首页 文章

如何通过用data.table或dplyr替换复杂和缓慢的plyr步骤来提高R代码的速度?

提问于
浏览
0

我一直在通过搜索其他人如何在Stackoverflow上做事来学习R,因此,我已经熟悉了plyr语法 . 我有以下四个使用ddply的plyr调用,这是我的代码的速率限制步骤 . 我的数据接近数十万到数百万条记录,而且我的代码主要由于data.table而过时,并且只受到四个速率限制但是关键的plyr步骤的限制 . 我想用dplyr或data.table替换它们,但我一直在努力复制语法,并希望得到任何帮助 .

1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))

2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))

3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec

4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))

对于(1)这里是基准:

#user  system elapsed 
#378.463   3.136 383.786

这是我在这些步骤中想要完成的事情(它们是相似的):

  • 它们涉及ID字段或ID字段的数据聚合

  • 使用粘贴和折叠进行粒度字符字段的聚合 . 例如,对于每个站点,字段可能是一个驱动程序's stops as he drops off packages where a 2502201 field would have values ' a ', ' b ', ' c' . 使用 stops_path = paste(stops, collapse = ",") 的Plyr会将这些步骤汇总到一行"a,b,c"

  • 数值数据有时会在同一个聚合步骤中求和,例如 ConvTot=sum(Conv)

我试图用dplyr或data.table复制它是不成功的 .

对于这些类型的聚合,使用一个优于另一个是否有优势?我已经看了这个,看起来data.table可能对我的非常简单的用例更好,因为语法更清晰:data.table vs dplyr: can one do something well the other can't or does poorly?

这是我尝试使用data.table复制上述(1)的失败:

setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]

这引发了一个错误: unused argument (by = list(df.activ.id, Channel)) 我从代码中写道,我在SO上研究了如何将粘贴到data.table中 . 我拿出了by参数只是为了看看会发生什么,并在下面的行中得到另一个错误:

mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]

错误是 "Error in paste(spotID, collapse = ",") : object 'spotID' not found" 这很奇怪,因为该字段肯定在数据中 . 我认为这个data.table行将使用by字段(df.activ.id和Channel)正确地聚合数据,并将字符字段与上面的(a,b,c)示例结合起来 .

显然,考虑到我正在使用的数据规模,我需要正确学习dplyr或data.table的语法,所以我已经注册了两个包的数据阵营类 . 尽管如此,我还是对如何在短期内解决这个问题表示感谢 .

谢谢!

1 回答

  • 1

    您使用 data.table 复制适用于我(除了该通道大写) . 以下是我尝试使用 dplyrdata.table 复制列表的第一步 .

    # required packages
    require(plyr)
    require(dplyr)
    require(data.table)
    

    样本数据

    mergeddf2 <- data.frame(df.activ.id = 1:5, 
                            channel = 1:8, 
                            mainID = 1:40, 
                            DateTime = Sys.Date() - 80:1, 
                            cat = letters[1:6], 
                            effrespflag = rnorm(240), 
                            othervar = 1, 
                            MarketingChannel = 2)
    

    plyr解决方案

    mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, 
                       spotsids = paste(mainID, collapse = ","), 
                       spotsdt = paste(DateTime, collapse = ","), 
                       spotsinfos = paste(cat, collapse = ","), 
                       effrespflags = paste(effrespflag, collapse = ","))
    

    dplyr解决方案

    mergeddf3.dplyr <- 
      mergeddf2 %>% 
      group_by(df.activ.id, channel) %>%
      summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
      magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags")) 
    # check for equality
    all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
    ## [1] TRUE
    

    data.table解决方案

    setDT(mergeddf2)
    mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), 
                                      spotsdt = paste(DateTime, collapse = ","), 
                                      spotsinfos = paste(cat, collapse = ","), 
                                      effrespflags = paste(effrespflag, collapse = ",")),
                               by=list(df.activ.id,channel)] 
    # check for equality
    all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
    ## [1] TRUE
    

相关问题