首页 文章

按行列出data.frames的快速矢量化合并

提问于
浏览
49

关于在SO列表中合并data.frame的大多数问题与我试图在这里得到的内容并不完全相关,但可以随意证明我的错误 .

我有一个data.frames列表 . 我想将行“rbind”换行到另一个data.frame . 实质上,所有第一行形成一个data.frame,第二行形成第二个data.frame等 . 结果将是与原始data.frame中的行数相同的长度列表 . 到目前为止,data.frames的尺寸相同 .

这里有一些数据可供使用 .

sample.list <- list(data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),
        data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)))

这就是我提出的优秀的'for循环' .

#solution 1
my.list <- vector("list", nrow(sample.list[[1]]))
for (i in 1:nrow(sample.list[[1]])) {
    for (j in 1:length(sample.list)) {
        my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
    }
}

#solution 2 (so far my favorite)
sample.list2 <- do.call("rbind", sample.list)
my.list2 <- vector("list", nrow(sample.list[[1]]))

for (i in 1:nrow(sample.list[[1]])) {
    my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
}

使用矢量化可以改善这一点而不需要太多的脑力训练吗?当然,正确答案将包含一段代码 . “是”作为答案并不重要 .

编辑

#solution 3 (a variant of solution 2 above)
ind <- rep(1:nrow(sample.list[[1]]), times = length(sample.list))
my.list3 <- split(x = sample.list2, f = ind)

标杆

我的列表更大,每个data.frame有更多的行 . 我对结果进行了基准测试,结果如下:

#solution 1
system.time(for (i in 1:nrow(sample.list[[1]])) {
    for (j in 1:length(sample.list)) {
        my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])
    }
})
   user  system elapsed 
 80.989   0.004  81.210 

# solution 2
system.time(for (i in 1:nrow(sample.list[[1]])) {
    my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ]
})
   user  system elapsed 
  0.957   0.160   1.126 

# solution 3
system.time(split(x = sample.list2, f = ind))
   user  system elapsed 
  1.104   0.204   1.332 

# solution Gabor
system.time(lapply(1:nr, bind.ith.rows))
   user  system elapsed 
  0.484   0.000   0.485 

# solution ncray
system.time(alply(do.call("cbind",sample.list), 1,
                .fun=matrix, ncol=ncol(sample.list[[1]]), byrow=TRUE,
                dimnames=list(1:length(sample.list),names(sample.list[[1]]))))
   user  system elapsed 
 11.296   0.016  11.365

3 回答

  • 5

    试试这个:

    bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
    nr <- nrow(sample.list[[1]])
    lapply(1:nr, bind.ith.rows)
    
  • 39

    一些解决方案可以更快地使用 data.table

    EDIT - 更大的数据集显示 data.table awesomeness甚至更多 .

    # here are some sample data 
    sample.list <- replicate(10000, data.frame(x = sample(1:100, 10), 
      y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)), simplify = F)
    

    Gabor的快速解决方案:

    # Solution Gabor
    bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))
    nr <- nrow(sample.list[[1]])
    system.time(rowbound <- lapply(1:nr, bind.ith.rows))
    
    ##    user  system elapsed 
    ##   25.87    0.01   25.92
    

    即使使用data.frames,data.table函数 rbindlist 也会更快 .

    library(data.table)
    fastbind.ith.rows <- function(i) rbindlist(lapply(sample.list, "[", i, TRUE))
    system.time(fastbound <- lapply(1:nr, fastbind.ith.rows))
    
    ##    user  system elapsed 
    ##   13.89    0.00   13.89
    

    一个data.table解决方案

    这是一个使用data.tables的解决方案 - 它是类固醇的 split 解决方案 .

    # data.table solution
    system.time({
        # change each element of sample.list to a data.table (and data.frame) this
        # is done instaneously by reference
        invisible(lapply(sample.list, setattr, name = "class", 
                   value = c("data.table",  "data.frame")))
        # combine into a big data set
        bigdata <- rbindlist(sample.list)
        # add a row index column (by refere3nce)
        index <- as.character(seq_len(nr))
        bigdata[, `:=`(rowid, index)]
        # set the key for binary searches
        setkey(bigdata, rowid)
        # split on this -
        dt_list <- lapply(index, function(i, j, x) x[i = J(i)], x = bigdata)
        # if you want to drop the `row id` column
        invisible(lapply(dt_list, function(x) set(x, j = "rowid", value = NULL)))
        # if you really don't want them to be data.tables run this line
        # invisible(lapply(dt_list, setattr,name = 'class', value =
        # c('data.frame')))
    })
    ################################
    ##    user  system elapsed    ##
    ##    0.08    0.00    0.08    ##
    ################################
    

    多么棒啊 data.table

    使用rbindlist警告用户

    rbindlist 很快,因为它不会执行 do.call(rbind,....) 的检查 . 例如,它假定任何因子列具有与列表的第一个元素中相同的级别 .

  • 46

    这是我与plyr的尝试,但我喜欢G.Grothendieck的方法:

    library(plyr)
    alply(do.call("cbind",sample.list), 1, .fun=matrix,
            ncol=ncol(sample.list[[1]]), byrow=TRUE,
            dimnames=list(1:length(sample.list),
            names(sample.list[[1]])
          ))
    

相关问题