使用dplyr多次汇总的有效方法-Java 学习之路

快速举例：

set.seed(123)
library("dplyr")
df <- data_frame(client=sample(letters, 200, replace=T), 
                 content=sample(LETTERS, 200, replace=T))

我观察到客户端与内容交互 . 我想知道每个客户使用了多少不同的内容 .

我做以下事情来获得我想要的东西：

df %>%
  group_by(client, content) %>%
  summarize(n=n()) %>%
  summarize(n_content=n())

# output
   client n_content
    (chr)     (int)
1       a         3
2       b         4
3       c         5
..    ...       ...

第一个 summarize 的重点是每个客户端/内容组合只能获得一行（因为一个客户端可能会多次使用相同的内容） . 因此第一个 n() 的输出对我来说没用，这让我觉得必须有一个更有效/更优雅的解决方案 .

有没有办法更有效地做到这一点？我正在寻找一种理想的与dplyr兼容的解决方案，但是基本R或其他软件包都可以 . 我对使用 data.table 的解决方案不感兴趣 .

2 回答

或者 group_by

df %>%
  group_by(client) %>%
  summarize(n_content=n_distinct(content))

那样快一点

f1=function() df %>%
  group_by(client) %>%
  summarize(n_content=n_distinct(content))

f2=function()df %>%
  distinct() %>%
  count(client)
microbenchmark(f1(),f2())

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 f1() 1.884358 1.996009 2.307482 2.123363 2.598729 3.318076   100  a 
 f2() 2.434831 2.532641 3.031416 2.817830 3.360372 5.462430   100   b

回复于 2024-05-18T22:59:08+08:00

你可以这样做：

df %>%
  distinct() %>%
  count(client)

Source: local data frame [26 x 2]

   client     n
    (chr) (int)
1       a     3
2       b     4
3       c     5
4       d    10
5       e     5
6       f     6
7       g     8
8       h     5
9       i     7
10      j    10
..    ...   ...

回复于 2024-05-18T22:59:08+08:00

使用dplyr多次汇总的有效方法

2 回答

相关问题