首页 文章

如何使用循环有条件地在新变量中创建值

提问于
浏览
1

我每五天收集一次关于植物发育或物候学的数据(使用分类变量“代码”编码),沿着横断面划分为78个连续区段 . 每个物种都在每个区段的横断面上进行调查 . 这项努力正在重复100年前的一项研究!

我想重新编码我的数据集,以克服原始研究编码系统的不足 .

原始编码系统(用于植物开花期):

K = flower bud
b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended

问题是,当我想分析我的数据时,这些代码不足以描述观察的背景 . 例如,代码'b1'和'b2'可以在开花期的早期和晚期发生 . 这使得难以以标准化方式“排列”我的观察结果 .

解决方案可以是循环或其他有效的方式来顺序移动观察(通过'Segment','Species','Date')来重新编码观察,基于它是在特定事件之前还是之后发生(在这种情况下)第一次'Code'被记录为“b3”) .

对于横断面和物种的任何给定区段,原始数据中的代码可能如下所示:

Date    Segment Species Code
26/05/2017  1   A   K
01/06/2017  1   A   b1
06/06/2017  1   A   b1
10/06/2017  1   A   b2
14/06/2017  1   A   b2
19/06/2017  1   A   b2
23/06/2017  1   A   b3
28/06/2017  1   A   b3
03/07/2017  1   A   b2
08/07/2017  1   A   b2
14/07/2017  1   A   b1
19/07/2017  1   A   b4

如果我考虑在赛季前使用数据,我会使用如下编码系统:

K = flower bud
b1a = single flower
b2a = sparse flowers (two or three)
b3 = flowers common (more than three)
b2b = sparse flowers (two or three)
b1b = single flower
B4 = flowering ended

通过对代码的这些更改,上面的示例数据将如下所示:

Date    Segment Species Code
26/05/2017  1   A   K
01/06/2017  1   A   b1a
06/06/2017  1   A   b1a
10/06/2017  1   A   b2a
14/06/2017  1   A   b2a
19/06/2017  1   A   b2a
23/06/2017  1   A   b3
28/06/2017  1   A   b3
03/07/2017  1   A   b2b
08/07/2017  1   A   b2b
14/07/2017  1   A   b1b
19/07/2017  1   A   b4

此外,我必须重新编码历史数据集,因此任何解决方案对两者都至关重要 .

注意: very 重要的是,在 first 遇到'b3'之后,会发生"a"附加"b"或'b2'的切换 . 这很重要,因为有时花的数据丰度在生长季节会波动 . 例如:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b2 # appears out of the "ideal" sequence
02-Aug-17   1   A   b3
07-Aug-17   1   A   b2 # appears out of the "ideal" sequence
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b1 # appears out of the "ideal" sequence
27-Aug-17   1   A   b2 
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

在这种情况下,数据看起来像:

Date    Segment Species Code
01-Jun-17   1   A   b1a
06-Jun-17   1   A   b1a
10-Jun-17   1   A   b2a
14-Jun-17   1   A   b2a
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b2b
02-Aug-17   1   A   b3
07-Aug-17   1   A   b2b
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2b
22-Aug-17   1   A   b1b
27-Aug-17   1   A   b2b 
02-Sep-17   1   A   b1b
07-Sep-17   1   A   b4

最后一点 . 由于北极地区的生长季节很短,并不是每个开花期(=代码)都发生在一个区域的每个物种 .

示例数据:

DT <- structure(list(Date = structure(c(17312, 17318, 17323, 17327, 
17331, 17336, 17340, 17345, 17350, 17355, 17361, 17366, 17312, 
17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350, 17355, 
17361, 17366, 17370, 17375, 17350, 17355, 17361, 17366, 17370, 
17312, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350, 
17355, 17361, 17366, 17312, 17318, 17323, 17327, 17331, 17336, 
17340, 17345, 17350, 17355, 17361, 17366, 17355, 17361, 17366, 
17370, 17375, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 
17380, 17385, 17390, 17395, 17400, 17405, 17411, 17416, 17318, 
17323, 17327, 17331, 17336, 17340, 17345, 17380, 17385, 17390, 
17395, 17400, 17405, 17411, 17416, 17318, 17323, 17327, 17331, 
17336, 17340, 17345, 17380, 17385, 17390, 17395, 17400, 17405, 
17411, 17416), class = "Date"), Segment = c(1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), Species = c("A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", 
"C", "C", "C", "C", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "C", "C", "C", "C", "C", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
), Code = c("K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", 
"b2", "b1", "b4", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3", 
"b2", "b2", "b2", "b1", "b1", "b4", "b1", "b1", "b2", "b2", "b4", 
"b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3", "b2", "b2", "b2", 
"b4", "K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", "b2", 
"b2", "b4", "b3", "b3", "b2", "b1", "b4", "b1", "b1", "b2", "b2", 
"b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4", 
"b1", "b1", "b2", "b2", "b3", "b3", "b2", "b3", "b2", "b3", "b2", 
"b1", "b2", "b1", "b4", "b1", "b1", "b2", "b2", "b3", "b3", "b2", 
"b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4")), .Names = c("Date", 
"Segment", "Species", "Code"), row.names = c(NA, -105L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>)

2 回答

  • 0

    使用 dplyr 可以通过以下方式完成:

    library(dplyr)
    DT %>% 
      group_by(Species, Segment) %>% 
      mutate(after_b3 = (cumsum(Code == "b3") > 0), 
             Code_new = case_when(Code %in% c("b1", "b2") & !after_b3 ~ paste0(Code, "a"), 
                                  Code %in% c("b1", "b2") & after_b3 ~ paste0(Code, "b"), 
                                  TRUE ~ Code)) 
    
    # A tibble: 105 x 6
    # Groups:   Segment, Species [9]
    #          Date Segment Species  Code after_b3 Code_new
    #        <date>   <dbl>   <chr> <chr>    <lgl>    <chr>
    #  1 2017-05-26       1       A     K    FALSE        K
    #  2 2017-06-01       1       A    b1    FALSE      b1a
    #  3 2017-06-06       1       A    b1    FALSE      b1a
    #  4 2017-06-10       1       A    b2    FALSE      b2a
    #  5 2017-06-14       1       A    b2    FALSE      b2a
    #  6 2017-06-19       1       A    b2    FALSE      b2a
    #  7 2017-06-23       1       A    b3     TRUE       b3
    #  8 2017-06-28       1       A    b3     TRUE       b3
    #  9 2017-07-03       1       A    b2     TRUE      b2b
    # 10 2017-07-08       1       A    b2     TRUE      b2b
    # ... with 95 more rows
    

    使用 group_by ,代码将应用于每个Segment,Species组合 . after_b3 列描述 Code 是否已经 "b3" . 然后通过检查几个案例来确定 Code_new .

  • 2

    也许不是最有效的方式,但它有效(考虑到我理解你的问题)

    library(data.table)
    DT <- as.data.table(DT)
    
    tmp_list <- list()
    for (seg in unique(DT$Segment)){ # seg <- 1
      for(spec in unique(DT$Species)){ # spec <- "C"
        tmp_list[[paste0(seg,"_",spec)]] <- DT[Segment%in%seg & Species%in%spec]
        index <- which(tmp_list[[paste0(seg,"_",spec)]]$Code=="b3")[1]
        rows <- nrow(tmp_list[[paste0(seg,"_",spec)]])
        if(!is.na(index)){
          tmp_list[[paste0(seg,"_",spec)]][index:rows,new_code:=ifelse(Code%in%"b1","b1b",
                                                                       ifelse(Code%in%"b2","b2b",Code))]
          tmp_list[[paste0(seg,"_",spec)]][1:index,new_code:=ifelse(Code%in%"b1","b1a",
                                                                    ifelse(Code%in%"b2","b2a",Code))]
        }else{
          tmp_list[[paste0(seg,"_",spec)]][,new_code:=new_code:=ifelse(Code%in%"b1","b1a",
                                                                ifelse(Code%in%"b2","b2a",Code))]
        }   
      }
    }
    final <- rbindlist(tmp_list)
    

    因此,通过细分和物种,我找到第一个 b3 ,之后 (and by after i mean for the next rows) 我分别将 b1b2 更改为 b1bb2b . 对于第一个 b3 之前的行,我分别将 b1b2 更改为 b1ab2a . if语句考虑了特定物种段组合没有 b3 的情况

相关问题