首页 文章

如何根据数据的整体顺序更改特定的分类变量

提问于
浏览
3

我每五天收集一次关于植物发育或物候学的数据(使用分类变量“代码”编码),沿着横断面划分为78个连续区段 . 每个物种都在每个区段的横断面上进行调查 .

我的研究重复了100年前的历史研究,我保留了最初的物候编码方案,但没有考虑如何在夏天之后分析数据!

我在收集数据时没有考虑的问题是代码遵循一个序列,其中一个代码在夏天的早晚出现 . 具体来说,代码是:

b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended

根据原始研究的方法,在夏季为任何开花植物收集的代码序列将类似于b1,b2,b3,b2,b1,b4 . 请注意,我们每隔五天访问样带,并且代码可能在连续几天内重复,例如b1,b1,b2,b2,b2,b2,b3,b3,b3,b2,b2,b1,b4 .

我想重新编码'b1'和'b2'代码如下(参见示例和示例数据):

1.如果'b1'出现在'b2'或'b3'之前那么它应该是'b1a'并且如果它出现在'b2'或'b3'之后那么它应该是'b1b' . 请注意,有时在观察序列中没有'b2'或'b3' .

2.如果'b2'发生在'b3'之前那么它应该是'b2a',如果它发生在'b3'之后它应该是'b2b' . 或者如果没有'b3'那么'b2'应该是'b2a' . 请注意,重要的是要记住,在最后一次出现'b3'之后,可能会有多次'b2'的观察(参见示例和示例数据) .

3.考虑'b1'和'b2'可能在没有和'b3'的情况下发生,在这种情况下,两者都会被编码为'b1a'和'b2a' .

以下是数据的样子:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b2
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
03-Jul-17   1   A   b2
08-Jul-17   1   A   b2
14-Jul-17   1   A   b1
19-Jul-17   1   A   b4
23-Jul-17   1   A   b4

它应该是这样的:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1a
10-Jun-17   1   A   b2a
14-Jun-17   1   A   b2a
19-Jun-17   1   A   b2a
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
03-Jul-17   1   A   b2b
08-Jul-17   1   A   b2b
14-Jul-17   1   A   b1b
19-Jul-17   1   A   b4
23-Jul-17   1   A   b4

以下是示例数据:

Test.Data<- structure(list(Date = structure(c(17318, 17323, 17327, 17331, 
17336, 17340, 17345, 17350, 17355, 17361, 17366, 17318, 17323, 
17327, 17331, 17336, 17340, 17345, 17350, 17355, 17361, 17366, 
17370, 17375, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 
17350, 17355, 17361, 17366, 17318, 17323, 17327, 17331, 17336, 
17340, 17345, 17350, 17355, 17361, 17366, 17370, 17375, 17355, 
17361, 17366, 17370, 17375, 17350, 17355, 17361, 17366, 17370
), class = "Date"), Segment = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 1, 1, 1, 1, 1), Species = c("A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"
), Code = c("b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", "b2", 
"b4", "b4", "b1", "b2", "b2", "b2", "b3", "b3", "b3", "b2", "b2", 
"b2", "b1", "b4", "b4", "b1", "b1", "b2", "b2", "b2", "b3", "b3", 
"b2", "b2", "b4", "b4", "b1", "b2", "b2", "b2", "b3", "b3", "b3", 
"b2", "b2", "b2", "b4", "b4", "b4", "b3", "b3", "b2", "b1", "b4", 
"b1", "b1", "b2", "b2", "b4")), .Names = c("Date", "Segment", 
"Species", "Code"), row.names = c(NA, -58L), class = "data.frame")

2 回答

  • 4

    使用data.table:

    library(data.table)
    setDT(Test.Data)
    Test.Data[, temp := rleid(Code), by = .(Segment, Species)] #unique ids for the sequence of codes
    Test.Data[Code == "b2", Code := paste0(Code, letters[rleid(temp)]), 
      by = .(Segment, Species)] #use the unique ids inside subset
    Test.Data[, temp := NULL]
    #          Date Segment Species Code
    # 1: 2017-06-01       1       A   b1
    # 2: 2017-06-06       1       A   b1
    # 3: 2017-06-10       1       A  b2a
    # 4: 2017-06-14       1       A  b2a
    # 5: 2017-06-19       1       A  b2a
    # 6: 2017-06-23       1       A   b3
    # 7: 2017-06-28       1       A   b3
    # 8: 2017-07-03       1       A  b2b
    # 9: 2017-07-08       1       A  b2b
    #10: 2017-07-14       1       A   b4
    #11: 2017-07-19       1       A   b4
    #12: 2017-06-01       1       B   b1
    #13: 2017-06-06       1       B  b2a
    #14: 2017-06-10       1       B  b2a
    #15: 2017-06-14       1       B  b2a
    #16: 2017-06-19       1       B   b3
    #17: 2017-06-23       1       B   b3
    #18: 2017-06-28       1       B   b3
    #19: 2017-07-03       1       B  b2b
    #20: 2017-07-08       1       B  b2b
    #21: 2017-07-14       1       B  b2b
    #</cont>
    
  • 2

    你可以使用 dplyr

    library(dplyr)
    Test.Data %>% 
      group_by(Species) %>% 
      mutate(hadb3 = cumsum(Code=="b3")>0) %>%
      mutate(Code = ifelse(Code=="b2" & !hadb3,"b2a",Code)) %>% 
      mutate(Code = ifelse(Code=="b2" & hadb3,"b2b",Code))
    

    结果:

    # A tibble: 48 x 5
    # Groups:   Species [2]
             Date Segment Species  Code hadb3
           <date>   <dbl>   <chr> <chr> <lgl>
     1 2017-06-01       1       A    b1 FALSE
     2 2017-06-06       1       A    b1 FALSE
     3 2017-06-10       1       A   b2a FALSE
     4 2017-06-14       1       A   b2a FALSE
     5 2017-06-19       1       A   b2a FALSE
     6 2017-06-23       1       A    b3  TRUE
     7 2017-06-28       1       A    b3  TRUE
     8 2017-07-03       1       A   b2b  TRUE
     9 2017-07-08       1       A   b2b  TRUE
    10 2017-07-14       1       A    b4  TRUE
    # ... with 38 more rows
    

    mutate(hadb3 = cumsum(Code=="b3")>0) 创建一个逻辑列,用于检查 b3 之前是否已出现,并且足以使用ifelse语句获取结果 .

相关问题