Home Articles

R:使用日期和ID按多个条件合并2个数据帧

Asked
Viewed 1741 times
2

我试图使用多个条件合并2数据帧并使用merge命令但无法获得成功的输出 .

#Data Frame df1#
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
df1<- data.frame(ID,Location, startdate, enddate)

#Data Frame df2#
ID<-c("A1", "A1", "A4")
N<- c(2,1,2)
Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
Amt<-c("2200","1000", "500")
df2<- data.frame(ID, N, Loss_Date,Amt)

I want to merge these 2 dataframe by using Location as common column and Loss_Date in df2 lies between (inclusive) Start_Date and End_Date in df2. 你可以看到df2中的第二个条目没有被映射,因为日期不在df1的范围内

#Required Output
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
N<-c(2,0,0,2)
Loss_Date <- c("2014-11-15", "NA", "NA", "2015-11-30")
Amt<-c("2200","0","0", "500")
Output<- data.frame(ID,Location, startdate, enddate,N, Loss_Date,Amt)

我使用Year和ID创建了一个公共ID,但是得到了错误的映射 . 尝试了各种使用合并和匹配的方法,但命令不起作用 . 我需要这个以超过170K的观察值运行 . 两个数据帧的长度都不相等 . 任何帮助将非常感激 .

4 Answers

  • 2

    我使用包 dplyr 完成了合并,这非常快速且易于使用 .

    你应该添加到你的数据框定义 stringsAsFactors=F

    df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)
     df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)
    

    因此,您的角色输入不会更改为因素,也不会给您带来不良后果

    install.packages("dplyr")
    library(dplyr)
    
    output <- full_join(df1, df2, by="ID") %>% 
    filter(Loss_Date >= startdate & Loss_Date <= enddate)
    

    输出:

    ID Location  startdate    enddate N  Loss_Date  Amt
    1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200
    2 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  500
    

    同样,根据注释的指定,如果要保留与条件不匹配的行,则应使用另一个函数:

    output2 <- left_join(df1, df2, by="ID") %>% 
     mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
     mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
     mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
     mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
     mutate(condition = ifelse(is.na(condition),T,condition)) %>%
     filter(condition) %>%
     select(-condition)
    

    首先创建一个匹配条件的新列,然后根据该条件将其他列更改为 0NA . 最后,取消选择新生成的列 . (注意 ifelseDate 的类更改为 numeric ,因此需要 as.Date

    ID Location  startdate    enddate N  Loss_Date  Amt
    1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200
    2 A2     234B 2014-01-01 2014-08-31 0       <NA>    0
    3 A3     012A 2015-10-01 2015-12-31 0       <NA>    0
    4 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  50
    
  • 2

    sqldf非常强大且易于阅读 . 检查此代码:

    library(sqldf)
    Output<-sqldf("
               SELECT L.*, r.N, r.Loss_Date, r.Amt
               FROM df1 as L
               LEFT JOIN df2 as r
               ON 
               L.ID=r.ID AND
                  r.Loss_Date BETWEEN L.startdate AND L.enddate
               ORDER BY L.ID")
    

    其中“L”表示df1(即df1表示为l),“r”表示df2(df2表示为r) .

  • 1

    我刚刚在@VincentBoned的回答中添加了一些额外的代码 .

    # create 1st dataframe
    ID<- c("A1", "A2","A3", "A4")
    Location <- c("012A","234B","012A","238C" )
    startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
    enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
    
    df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)
    
    
    # create 2nd dataframe
    ID<-c("A1", "A1", "A4")
    N<- c(2,1,2)
    Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
    Amt<-c("2200","1000", "500")
    
    df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)
    
    
    library(dplyr)
    
    full_join(df1, df2, by="ID") %>% 
      mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
      mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
      mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
      mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
      select(-condition) %>%
      group_by(ID) %>%                              # for each ID
      mutate(Nrows = n()) %>%                       # count how many rows they have in the final table
      ungroup() %>%
      filter(!(Nrows > 1 & is.na(Loss_Date))) %>%   # filter out rows with IDs that have more than 1 rows and those rows are not matched
      select(-Nrows)
    
    #   ID Location  startdate    enddate N  Loss_Date  Amt 
    # 1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200 
    # 2 A2     234B 2014-01-01 2014-08-31 0       <NA>    0 
    # 3 A3     012A 2015-10-01 2015-12-31 0       <NA>    0 
    # 4 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  500
    

    如果您了解上述代码的工作原理(一步一步),您可以使用更紧凑的版本,返回相同的结果:

    full_join(df1, df2, by="ID") %>% 
      mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate),
             N = ifelse(condition & !is.na(condition), N, 0),
             Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01"),
             Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
      group_by(ID) %>%                             
      mutate(Nrows = n()) %>%                      
      filter(!(Nrows > 1 & is.na(Loss_Date))) %>%
      select(-c(condition, Nrows))
    
  • 1

    在data.table(v1.9.7)的当前开发版本中,实现了非equi连接 . 有了这个,我们可以做到:

    require(data.table) # v1.9.7+
    setDT(df2)[df1, .(ID, Location, startdate, enddate, N, x.Loss_Date, Amt), 
                          on=.(ID, Loss_Date>=startdate, Loss_Date<=enddate)]
    #    ID Location  startdate    enddate  N x.Loss_Date  Amt
    # 1: A1     012A 2014-11-01 2014-12-31  2  2014-11-15 2200
    # 2: A2     234B 2014-01-01 2014-08-31 NA        <NA>   NA
    # 3: A3     012A 2015-10-01 2015-12-31 NA        <NA>   NA
    # 4: A4     238C 2015-01-01 2015-12-31  2  2015-11-30  500
    

Related