按组连接字符串，dplyr用于多列[重复]-Java 学习之路

这个问题在这里已有答案：

Collapse all columns by an ID column [duplicate] 5个答案

嗨，我需要按多个列的组连接字符串 . 我意识到这个问题的版本已被多次询问（参见Aggregating by unique identifier and concatenating related values into a string），但它们通常涉及连接单个列的值 .

我的数据集类似于：

Sample  group   Gene1   Gene2   Gene3
A       1       a       NA      NA
A       2       b       NA      NA
B       1       NA      c       NA
C       1       a       NA      d
C       2       b       NA      e
C       3       c       NA      NA

我想把它变成一种格式，每个样本只需要1行（组列是可选的）：

Sample  group   Gene1   Gene2   Gene3
A       1,2     a,b     NA      NA
B       1       NA      c       NA
C       1,2,3   a,b,c   NA      d,e

由于基因的数量可以达到数千，我不能简单地指定我希望连接的列 . 我知道 aggregate 或 dplyr 可用于获取组，但我无法弄清楚如何为多列做到这一点 .

提前致谢！

编辑

由于我的数据集非常大，包含数千个基因，我意识到dplyr太慢了 . 我一直在尝试使用data.table，下面的代码也可以得到我想要的东西：

setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]

输出现在是：

Sample group Gene1 Gene2 Gene3
1:      A   1,2   a,b            
2:      B     1           c      
3:      C 1,2,3 a,b,c         d,e

感谢你的帮助！

2 回答

1
出于这些目的，有 summarise_all ， summarise_at 和 summarise_if 函数 . 使用 summarise_all ：
```
df %>%
  group_by(Sample) %>%
  summarise_all(funs(paste(na.omit(.), collapse = ",")))
```
#A tibble：3×5
样品组Gene1 Gene2 Gene3
<chr> <chr> <chr> <chr> <chr>
1 A 1,2 a，b
2 B 1 c
3 C 1,2,3 a，b，c d，e
回复于 2024-04-26T07:50:31+08:00

使用 dplyr ，您可以尝试：

dft %>%
  group_by(Sample) %>%
  summarise_each(funs( toString(unique(.))))

这使：

# A tibble: 3 × 5
  Sample   group   Gene1 Gene2    Gene3
   <chr>   <chr>   <chr> <chr>    <chr>
1      A    1, 2    a, b    NA       NA
2      B       1      NA     c       NA
3      C 1, 2, 3 a, b, c    NA d, e, NA

编辑：@Axeman有正确的想法使用 na.omit(.) 摆脱空值

回复于 2024-04-26T07:50:31+08:00

按组连接字符串，dplyr用于多列[重复]

编辑

2 回答

相关问题