首页 文章

将data.frame中的多个列连接到引用data.frame中的单个列

提问于
浏览
1

我想加入两个数据帧 . 但是,与普通连接不同,我想匹配第一个和第二个列的一系列列 . 基本上我有一个网站列表,其中提到了最近的周边网站 . 我需要在一个单独的数据框中查找最近的站点全Gauge和LTA ID . 我提供了一些示例数据帧,包括一个示例输出,但实际上并不是这么整洁(并且有更多的列和行)这就是为什么我需要查找 TestRefList 中的 Surrogate 量表,而不是创建以下方法 .

library(plyr)
library(tidyverse)

TestRefList <- data.frame(Site = paste0("sl",1:10,".1"), Gauge = paste0(1:10,".1","/110.00/1"), LTA = paste0(1:10,".1","/110.99/1"), stringsAsFactors = F)
Surrogates <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), stringsAsFactors = F)
HopefulOutput <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), 
                    nearest1Gauge = paste0(1:3,".1","/110.00/1"), nearest1LTA = paste0(1:3,".1","/110.99/1"), 
                    nearest2Gauge = paste0(7:9,".1","/110.00/1"), nearest2LTA = paste0(7:9,".1","/110.99/1"), stringsAsFactors = F)

我以为我可以使用plyr :: ldply和dplyr :: left_join的某些组合,例如:Out < - ldply(names(Surrogates)[2:3],function(x)left_join(Surrogates,TestRefList,by = c(paste0) (x,'=“网站”'))))

但是我无法使用列表中的名称加入工作 . 我已经在列表之外尝试了一些“和”的安排,例如:

left_join(Surrogates,TestRefList, by = c(paste0('"',names(Surrogates)[2],'"' , '="Site"')))

即使我可以让这部分工作,我也不确定它在ldply中是如何工作的 .

有任何想法吗?如果有必要,我很高兴能够采用完全不同的方式,尽管我对data.frames和tidyverse比data.table选项更舒服

3 回答

  • 0

    我提供基于 data.table 的解决方案 . 当然可以按照您的要求使用 dplyr 完成您的任务 . 但是我不知道dplyr能够解决这个问题 . 另外,我认为下面的data.table解决方案非常优雅和快速,只要您愿意在工作流程中添加另一个包 . 此外,此代码已经适用于数据中的任意数量的"nearest n"列 .

    library(data.table)
    
    # Melt the Surrogate data, providing useful column names.
    surrogate_dat = melt(data.table(Surrogates), 
                    id.vars="Primary", 
                    value.name="Site", 
                    variable.name="nearest_site_group")
    #    Primary nearest_site_group  Site
    # 1:   sl2.1           nearest1 sl1.1
    # 2:   sl4.1           nearest1 sl2.1
    # 3:   sl6.1           nearest1 sl3.1
    # 4:   sl2.1           nearest2 sl7.1
    # 5:   sl4.1           nearest2 sl8.1
    # 6:   sl6.1           nearest2 sl9.1
    
    # Merge melted surrogate data with reference list data.
    merged_dat = merge(x=surrogate_dat, 
                       y=data.table(TestRefList), 
                       by="Site")
    #     Site Primary nearest_site_group        Gauge          LTA
    # 1: sl1.1   sl2.1           nearest1 1.1/110.00/1 1.1/110.99/1
    # 2: sl2.1   sl4.1           nearest1 2.1/110.00/1 2.1/110.99/1
    # 3: sl3.1   sl6.1           nearest1 3.1/110.00/1 3.1/110.99/1
    # 4: sl7.1   sl2.1           nearest2 7.1/110.00/1 7.1/110.99/1
    # 5: sl8.1   sl4.1           nearest2 8.1/110.00/1 8.1/110.99/1
    # 6: sl9.1   sl6.1           nearest2 9.1/110.00/1 9.1/110.99/1
    
    # 'Cast' merged data back to wide form, specifying 3 value variables.
    results= dcast(data=merged_dat, 
                   formula=Primary ~ nearest_site_group, 
                   value.var=c("Site", "Gauge", "LTA"))
    #    Primary Site_nearest1 Site_nearest2 Gauge_nearest1 Gauge_nearest2
    # 1:   sl2.1         sl1.1         sl7.1   1.1/110.00/1   7.1/110.00/1
    # 2:   sl4.1         sl2.1         sl8.1   2.1/110.00/1   8.1/110.00/1
    # 3:   sl6.1         sl3.1         sl9.1   3.1/110.00/1   9.1/110.00/1
    #    LTA_nearest1 LTA_nearest2
    # 1: 1.1/110.99/1 7.1/110.99/1
    # 2: 2.1/110.99/1 8.1/110.99/1
    # 3: 3.1/110.99/1 9.1/110.99/1
    
  • 1
    Reduce(function(x, fld) merge(x, TestRefList, by.x=fld, by.y="Site"),
           c("nearest1", "nearest2"), init = Surrogates)
    #   nearest2 nearest1 Primary      Gauge.x        LTA.x      Gauge.y        LTA.y
    # 1    sl7.1    sl1.1   sl2.1 1.1/110.00/1 1.1/110.99/1 7.1/110.00/1 7.1/110.99/1
    # 2    sl8.1    sl2.1   sl4.1 2.1/110.00/1 2.1/110.99/1 8.1/110.00/1 8.1/110.99/1
    # 3    sl9.1    sl3.1   sl6.1 3.1/110.00/1 3.1/110.99/1 9.1/110.00/1 9.1/110.99/1
    

    您可以根据需要重命名列 . 这可以通过 dplyr::left_join 完成,但几乎没有变化:

    Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
           c("nearest1", "nearest2"), init = Surrogates)
    

    或在管道内:

    Surrogates %>% 
      Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
             c("nearest1", "nearest2"), init = .)
    
  • 0

    这是 Surrogates 中任意数量的"nearest"列的通用解决方案 . 它首先得到"nearest"列的向量,然后从那里开始 .

    # get list of columns matching "nearest"
    nearestCols <- colnames(Surrogates) %>%
      `[`(grepl("nearest", .))
    
    # output data.frame
    out <- Surrogates
    
    # for each "nearest" column, merge Gauge and LTA
    for (n in nearestCols) {
      out <- merge(out, TestRefList, by.x = n, by.y = "Site", all.x = TRUE)
      colnames(out)[(ncol(out)-1):ncol(out)] <- paste0(n, c("Gauge", "LTA"))
    }
    
    # re-order the columns
    out <- out[, c(length(nearestCols) + 1, length(nearestCols):1, (length(nearestCols)+2):ncol(out))]
    

    输出:

    > out
      Primary nearest1 nearest2 nearest1Gauge  nearest1LTA nearest2Gauge  nearest2LTA
    1   sl2.1    sl1.1    sl7.1  1.1/110.00/1 1.1/110.99/1  7.1/110.00/1 7.1/110.99/1
    2   sl4.1    sl2.1    sl8.1  2.1/110.00/1 2.1/110.99/1  8.1/110.00/1 8.1/110.99/1
    3   sl6.1    sl3.1    sl9.1  3.1/110.00/1 3.1/110.99/1  9.1/110.00/1 9.1/110.99/1
    > identical(out, HopefulOutput)
    [1] TRUE
    

相关问题