根据两个数据帧中存在的列(“gridcell”列)中的值,我想要匹配两个数据帧 . 如果不是因为需要为数据框中的每个唯一日期单独完成,这将是一项简单的任务 .
以下是一些示例数据:
> dput(df1)
structure(list(index = 1:7, date = c("13/04/2011", "13/04/2011",
"04/04/2011", "04/04/2011", "04/04/2011", "28/03/2011", "28/03/2011"
), yrday = c(103L, 103L, 94L, 94L, 94L, 87L, 87L), gridcell = c(6L,
9L, 2L, 5L, 8L, 3L, 4L), dist = c(178L, 158L, 137L, 116L, 95L,
135L, 115L), ang = c(148, 147.6, 163.6, 159.7, 152.5, 152.2,
121.9)), .Names = c("index", "date", "yrday", "gridcell", "dist",
"ang"), class = "data.frame", row.names = c(NA, -7L))
> dput(df2)
structure(list(date = c("28/03/2011", "28/03/2011", "28/03/2011",
"28/03/2011", "28/03/2011", "28/03/2011", "28/03/2011", "28/03/2011",
"28/03/2011", "29/03/2011", "29/03/2011", "29/03/2011", "29/03/2011",
"29/03/2011", "29/03/2011", "29/03/2011", "29/03/2011", "29/03/2011",
"04/04/2011", "04/04/2011", "04/04/2011", "04/04/2011", "04/04/2011",
"04/04/2011", "04/04/2011", "04/04/2011", "04/04/2011", "13/04/2011",
"13/04/2011", "13/04/2011", "13/04/2011", "13/04/2011", "13/04/2011",
"13/04/2011", "13/04/2011", "13/04/2011"), yrday = c(87L, 87L,
87L, 87L, 87L, 87L, 87L, 87L, 87L, 88L, 88L, 88L, 88L, 88L, 88L,
88L, 88L, 88L, 94L, 94L, 94L, 94L, 94L, 94L, 94L, 94L, 94L, 103L,
103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L), gridcell = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L), r = c(161L, 162L, 162L, 164L, 167L, 168L, 169L,
170L, 170L, 171L, 170L, 169L, 168L, 158L, 160L, 162L, 164L, 165L,
263L, 258L, 255L, 250L, 246L, 242L, 239L, 238L, 228L, 235L, 234L,
231L, 230L, 229L, 228L, 227L, 243L, 242L)), .Names = c("date",
"yrday", "gridcell", "r"), class = "data.frame", row.names = c(NA,
-36L))
> head(df1)
index date yrday gridcell dist ang
1 1 13/04/2011 103 6 178 148.0
2 2 13/04/2011 103 9 158 147.6
3 3 04/04/2011 94 2 137 163.6
4 4 04/04/2011 94 5 116 159.7
5 5 04/04/2011 94 8 95 152.5
> head(df2)
date yrday gridcell r
1 28/03/2011 87 1 161
2 28/03/2011 87 2 162
3 28/03/2011 87 3 162
4 28/03/2011 87 4 164
5 28/03/2011 87 5 167
我想最终得到一个新的df1数据框,其中包含来自df2的匹配行,基于每个日期内相同的'gridcell'值(如下所示):
index date_1 yrday_1 gridcell_1 dist ang yrday_2 gridcell_2 r
1 1 13/04/2011 103 6 178 148.0 103 6 228
2 2 13/04/2011 103 9 158 147.6 103 9 242
3 3 04/04/2011 94 2 137 163.6 94 2 258
4 4 04/04/2011 94 5 116 159.7 94 5 246
5 5 04/04/2011 94 8 95 152.5 94 8 238
6 6 28/03/2011 87 3 135 152.2 87 3 162
7 7 28/03/2011 87 4 115 121.9 87 4 164
到目前为止,我已经尝试通过'date'列合并两个数据帧,这给出了一个新的数据帧,其中来自df1的每一行重复df2中匹配日期的行数(即每个可能的'gridcell')来自df2的值与唯一的df1行匹配) .
df1$date = as.Date(df1$date, format="%d/%m/%Y")
df2$date = as.Date(df2$date, format="%d/%m/%Y")
nw.df = merge(df1,df2, by="date")
我确信具有唯一值的'index'列可以在诸如 ddply
之类的函数中用于压缩新数据帧,只留下每个唯一'index'值具有匹配'gridcell'列值的行(即 ddply(nw.df, .(index, …), summarise, …)
) . 我只是想不通怎么做!任何建议/帮助将不胜感激!谢谢 .
1 回答
您可以将'date'和'gridcell'指定为用于合并的列:
此代码的缺点是'yr.day'列是重复的 . 因此,您可能希望将“df2”子集仅包含用于合并的列以及要添加的列(此处为“r”):
请注意,我们不需要在此处指定
by
列 . 如果没有给出by
,merge
找到用于合并的列by = intersect(names(x), names(y))
(参见?merge
),这里:intersect(names(df1), names(df2[ , c("date", "gridcell", "r")]))
但是,如果您希望明确(有时候有用......),这将得到相同的结果: