使用dplyr mutate获取唯一值的cumsum-Java 学习之路

虚拟数据集是：

data <- data.frame(
  id = c(1,1,2,2,3,4,5,6),
  value = c(10,10,20,20,10,30,40,50),
  other = c(1,2,3,4,5,6,7,8)
)

数据是 dplyr 管道中 group_by(id) 操作的输出 . 每个 id 最多与一个值相关联，而两个不同的 id 可以具有相同的值 . 我需要通过添加新列来查找ID之间的累积总和： cum_col = c(10,10,30,30,40,70,110,160) mutate 中的 cumsum 将在整列值中找到累积总和，并且不会为每个组选择一个值 . summarise 没用，因为我需要保留其他列 .

有没有出路而没有使用 summarise 然后 join -ing它向后？或者，如果之前已经回答，请指出我链接 .

编辑：仅供参考，实际数据有大约200万行和100列 .

3 回答

一种替代方法可以是通过id列嵌套数据帧，计算累积总和然后不需要：

data %>% 
    group_by(id) %>% nest() %>% 
    mutate(cum_col = cumsum(sapply(data, function(dat) dat$value[1]))) %>% 
    unnest() 

# A tibble: 8 x 4
#     id cum_col value other
#  <dbl>   <dbl> <dbl> <dbl>
#1     1      10    10     1
#2     1      10    10     2
#3     2      30    20     3
#4     2      30    20     4
#5     3      40    10     5
#6     4      70    30     6
#7     5     110    40     7
#8     6     160    50     8

与 summarize 和 join 比较：

summarise_f <- function(data) data %>% 
    group_by(id) %>% 
    summarise(val = first(value)) %>%
    mutate(cum_col = cumsum(val)) %>%
    select(-val) %>%
    inner_join(data, by="id")

nest_f <- function(data) data %>% 
    group_by(id) %>% nest() %>% 
    mutate(cum_col = cumsum(sapply(data, function(dat) dat$value[1]))) %>% 
    unnest() 

df <- bind_rows(rep(list(data), 100000))

microbenchmark::microbenchmark(summarise_f(df), nest_f(df))
#Unit: milliseconds
#            expr       min        lq     mean    median        uq      max neval
# summarise_f(df)  79.78891  89.65753 117.8480  93.56766  99.97694 277.3773   100
#      nest_f(df) 191.10597 208.07364 280.2466 225.65567 369.20202 524.5106   100

Summarize 然后 join 实际上更快 .

使用更大的数据集：

df <- bind_rows(rep(list(data), 1000000))
microbenchmark::microbenchmark(summarise_f(df), nest_f(df))
#Unit: milliseconds
#            expr       min        lq      mean    median       uq      max neval
# summarise_f(df)  819.5588  905.2136  993.4916  961.1797 1040.947 1480.391   100
#      nest_f(df) 1768.3060 1992.6753 2069.1454 2057.3091 2162.440 2501.715   100

回复于 2024-05-05T12:36:19+08:00

另一种方法是我们创建一个虚拟列（ cols ），每个组只有第一个 value ，其余的被0替换，然后我们在整个列上取 cumsum .

library(dplyr)
data %>%
  group_by(id) %>%
  mutate(cols = c(value[1], rep(0, n() -1))) %>%
  ungroup() %>%
  mutate(cum_col = cumsum(cols)) %>%
  select(-cols)


# A tibble: 8 x 4
#     id value other cum_col
#  <dbl> <dbl> <dbl>   <dbl>
#1     1    10     1      10
#2     1    10     2      10
#3     2    20     3      30
#4     2    20     4      30
#5     3    10     5      40
#6     4    30     6      70
#7     5    40     7     110
#8     6    50     8     160

回复于 2024-05-05T12:36:19+08:00

我们也可以用 duplicated

library(dplyr)
data %>%
     mutate(cum_col = cumsum(value*!duplicated(id)))
#  id value other cum_col
#1  1    10     1      10
#2  1    10     2      10
#3  2    20     3      30
#4  2    20     4      30
#5  3    10     5      40
#6  4    30     6      70
#7  5    40     7     110
#8  6    50     8     160

回复于 2024-05-05T12:36:19+08:00

使用dplyr mutate获取唯一值的cumsum

3 回答

相关问题