首页 文章

基于第二向量的子集和

提问于
浏览
0

我有两个向量:

a <- c(1,1,2,3,4,4,4,4,5,6)
b <- c(T,F,T,F,T,T,F,F,F,T)

我想有一个向量告诉我 ba 中每个唯一值有多少TRUE . (第二栏)

[,1] [,2]
[1,]    1    1
[2,]    2    1
[3,]    3    0
[4,]    4    2
[5,]    5    0
[6,]    6    1

我能来到这里的最好的就是使用sapply:

sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)

这很好,但对于较大的向量,它相当慢 . (我尝试了一些子集变体 . )

a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)

microbenchmark::microbenchmark(
    subset = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
    , iN = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a %in% uniqueA & b), a = a, b = b)
    , equal = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a == uniqueA & b), a = a, b = b)
    , times = 5
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
 subset  389.1995  390.6002  413.6969  393.0396  445.6553  449.9897     5
     iN 2746.8407 2798.0462 2797.3155 2806.9477 2814.6317 2820.1110     5
  equal 1080.3430 1089.2507 1111.0267 1096.8082 1135.1957 1153.5358     5

有没有人知道如何更快地做到这一点?

3 回答

  • 1

    你可以使用 aggregate

    aggregate(b, list(a), sum)
    

    为了获得最快的性能,我建议使用 data.table . 设置需要更长时间,但对于大量数据,性能应该非常好 .

    library(data.table)
    dt <- data.table(a = a, b = b)
    dt[,sum(b), by = a]
    

    速度测试比较(1)聚合,(2)sapply,(3)data.table,(4)tapply:

    a <- sample(1:1000, 1e5, replace = TRUE)
      b <- sample(c(T,F), 1e5, replace = TRUE)
    
      summarize_dt <- function(x) {
        dt <- data.table(a = a, b = b)
        dt[,sum(b), by = a]
      }
    
      microbenchmark::microbenchmark(
        aggregate = aggregate(b, list(a), sum),
        sapply = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b),
        datatable = summarize_dt(),
        tapply = tapply(b, a, sum)
      )
    
          #expr        min         lq       mean     median         uq        max neval
     #aggregate 130.995347 133.672041 141.404597 135.301762 137.199151 213.730345   100
        #sapply 335.344866 357.387474 394.432339 411.994214 425.604144 486.548520   100
     #datatable   1.540011   1.914712   2.430220   2.027578   2.239999   5.297593   100
        #tapply   3.075646   3.627395   4.719595   4.089434   5.934675   8.758332   100
    

    看起来 data.table 是最快的

  • 1

    这个可能在基数R中使用 table

    t <- table(a[b])
    z <- as.numeric(names(t))
    rbind(unname(cbind(z, t)), cbind(setdiff(unique(a),z),0))
    
        # [,1] [,2]
    # [1,]    1    1
    # [2,]    2    1
    # [3,]    4    2
    # [4,]    6    1
    # [5,]    3    0
    # [6,]    5    0
    

    如果你想要那些数字为非零的 TRUE ,那么只需_2847614就足够了 .

  • 3

    或者我们可以使用 tidyverse

    library(tidyverse)
    tibble(a, b) %>% 
           group_by(a) %>%
           summarise(b = sum(b))
    

    基本R选项将是

    rowsum(+b, a)
    

相关问题