我每五天收集一次关于植物发育或物候学的数据(使用分类变量“代码”编码),沿着横断面划分为78个连续区段 . 每个物种都在每个区段的横断面上进行调查 . 这项努力正在重复100年前的一项研究!
我想重新编码我的数据集,以克服原始研究编码系统的不足 .
原始编码系统(用于植物开花期):
K = flower bud
b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended
问题是,当我想分析我的数据时,这些代码不足以描述观察的背景 . 例如,代码'b1'和'b2'可以在开花期的早期和晚期发生 . 这使得难以以标准化方式“排列”我的观察结果 .
解决方案可以是循环或其他有效的方式来顺序移动观察(通过'Segment','Species','Date')来重新编码观察,基于它是在特定事件之前还是之后发生(在这种情况下)第一次'Code'被记录为“b3”) .
对于横断面和物种的任何给定区段,原始数据中的代码可能如下所示:
Date Segment Species Code
26/05/2017 1 A K
01/06/2017 1 A b1
06/06/2017 1 A b1
10/06/2017 1 A b2
14/06/2017 1 A b2
19/06/2017 1 A b2
23/06/2017 1 A b3
28/06/2017 1 A b3
03/07/2017 1 A b2
08/07/2017 1 A b2
14/07/2017 1 A b1
19/07/2017 1 A b4
如果我考虑在赛季前使用数据,我会使用如下编码系统:
K = flower bud
b1a = single flower
b2a = sparse flowers (two or three)
b3 = flowers common (more than three)
b2b = sparse flowers (two or three)
b1b = single flower
B4 = flowering ended
通过对代码的这些更改,上面的示例数据将如下所示:
Date Segment Species Code
26/05/2017 1 A K
01/06/2017 1 A b1a
06/06/2017 1 A b1a
10/06/2017 1 A b2a
14/06/2017 1 A b2a
19/06/2017 1 A b2a
23/06/2017 1 A b3
28/06/2017 1 A b3
03/07/2017 1 A b2b
08/07/2017 1 A b2b
14/07/2017 1 A b1b
19/07/2017 1 A b4
此外,我必须重新编码历史数据集,因此任何解决方案对两者都至关重要 .
注意: very 重要的是,在 first 遇到'b3'之后,会发生"a"附加"b"或'b2'的切换 . 这很重要,因为有时花的数据丰度在生长季节会波动 . 例如:
Date Segment Species Code
01-Jun-17 1 A b1
06-Jun-17 1 A b1
10-Jun-17 1 A b2
14-Jun-17 1 A b2
19-Jun-17 1 A b3
23-Jun-17 1 A b3
28-Jun-17 1 A b2 # appears out of the "ideal" sequence
02-Aug-17 1 A b3
07-Aug-17 1 A b2 # appears out of the "ideal" sequence
12-Aug-17 1 A b3
17-Aug-17 1 A b2
22-Aug-17 1 A b1 # appears out of the "ideal" sequence
27-Aug-17 1 A b2
02-Sep-17 1 A b1
07-Sep-17 1 A b4
在这种情况下,数据看起来像:
Date Segment Species Code
01-Jun-17 1 A b1a
06-Jun-17 1 A b1a
10-Jun-17 1 A b2a
14-Jun-17 1 A b2a
19-Jun-17 1 A b3
23-Jun-17 1 A b3
28-Jun-17 1 A b2b
02-Aug-17 1 A b3
07-Aug-17 1 A b2b
12-Aug-17 1 A b3
17-Aug-17 1 A b2b
22-Aug-17 1 A b1b
27-Aug-17 1 A b2b
02-Sep-17 1 A b1b
07-Sep-17 1 A b4
最后一点 . 由于北极地区的生长季节很短,并不是每个开花期(=代码)都发生在一个区域的每个物种 .
示例数据:
DT <- structure(list(Date = structure(c(17312, 17318, 17323, 17327,
17331, 17336, 17340, 17345, 17350, 17355, 17361, 17366, 17312,
17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350, 17355,
17361, 17366, 17370, 17375, 17350, 17355, 17361, 17366, 17370,
17312, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350,
17355, 17361, 17366, 17312, 17318, 17323, 17327, 17331, 17336,
17340, 17345, 17350, 17355, 17361, 17366, 17355, 17361, 17366,
17370, 17375, 17318, 17323, 17327, 17331, 17336, 17340, 17345,
17380, 17385, 17390, 17395, 17400, 17405, 17411, 17416, 17318,
17323, 17327, 17331, 17336, 17340, 17345, 17380, 17385, 17390,
17395, 17400, 17405, 17411, 17416, 17318, 17323, 17327, 17331,
17336, 17340, 17345, 17380, 17385, 17390, 17395, 17400, 17405,
17411, 17416), class = "Date"), Segment = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), Species = c("A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C",
"C", "C", "C", "C", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "C", "C", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
), Code = c("K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2",
"b2", "b1", "b4", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3",
"b2", "b2", "b2", "b1", "b1", "b4", "b1", "b1", "b2", "b2", "b4",
"b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3", "b2", "b2", "b2",
"b4", "K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", "b2",
"b2", "b4", "b3", "b3", "b2", "b1", "b4", "b1", "b1", "b2", "b2",
"b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4",
"b1", "b1", "b2", "b2", "b3", "b3", "b2", "b3", "b2", "b3", "b2",
"b1", "b2", "b1", "b4", "b1", "b1", "b2", "b2", "b3", "b3", "b2",
"b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4")), .Names = c("Date",
"Segment", "Species", "Code"), row.names = c(NA, -105L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>)
2 回答
使用
dplyr
可以通过以下方式完成:使用
group_by
,代码将应用于每个Segment,Species组合 .after_b3
列描述Code
是否已经"b3"
. 然后通过检查几个案例来确定Code_new
.也许不是最有效的方式,但它有效(考虑到我理解你的问题)
因此,通过细分和物种,我找到第一个
b3
,之后 (and by after i mean for the next rows) 我分别将b1
和b2
更改为b1b
和b2b
. 对于第一个b3
之前的行,我分别将b1
和b2
更改为b1a
和b2a
. if语句考虑了特定物种段组合没有b3
的情况