首页 文章

使用dplyr汇总并保持相同的变量名称

提问于
浏览
5

我发现data.table和dplyr在尝试做同样的事情时会有不同的结果 . 我想使用dplyr语法,但让它以data.table的方式进行计算 . 用例是我想在表格中添加小计 . 为此,我需要对每个变量进行一些聚合,但是保留相同的变量名称(在转换后的版本中) . Data.table允许我对变量执行一些聚合并保持相同的名称 . 然后用同一个变量做另一个聚合 . 它将继续使用未转换的版本 . 但是,Dplyr将使用转换后的版本 .

在摘要文档中,它说:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))

这基本上是我遇到的问题,但我想知道是否有一个很好的解决方法 . 我发现的一件事就是将变换后的变量命名为其他东西然后在最后重命名它,但这对我来说并不是很好 . 如果有一个很好的方法来做小计,那也很好 . 我环顾了这个网站,没有看到这个确切的情况 . 任何帮助将不胜感激!

这里我做了一个简单的例子,一次是data.table的结果,一次是dplyr的 . 我想采用这个简单的表并附加一个小计行,它是感兴趣的列的加权平均值(总计) .

library(data.table)
library(dplyr)

dt <- data.table(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]

dt <- rbind(dt[, .(Group, Count_Dist, Total)],
      dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")

dt

df <- data.frame(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Count_Dist*Total))

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Total) %>% 
  rbind(df %>%
          mutate(Count_Dist = Count/sum(Count)) %>%
          summarize(Group = "All",
                    Count_Dist = sum(Count_Dist),
                    Total = sum(Count_Dist*Total))) %>% 
  rename(Weighted_Avg_Total = Total)

再次感谢任何帮助!

2 回答

  • 3

    一个可能的解决方案是跳过 mutate 步骤并使用 transmute 作为第一个 mutate / select -step并直接从原始变量计算所需的变量,而不为第二个 mutate -step创建一个中间变量:

    df %>% 
      transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
      bind_rows(df %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count/sum(Count)),
                            Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
    

    这使:

    Group Count_Dist Weighted_Avg_Total
    1 A 0.09345794 50.0000
    2 B 0.14018692 300.0000
    3 C 0.11214953 600.0000
    4 D 0.18691589 400.0000
    5 E 0.46728972 1000.0000
    6全部1.00000000 656.0748


    另一种可能的解决方案是在 dplyr 中更改计算新变量的顺序,然后使用 select 将列顺序恢复为您最初想要的值:

    df %>% 
      mutate(Count_Dist = Count/sum(Count)) %>%
      select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
      bind_rows(df %>%
                  mutate(Count_Dist = Count/sum(Count)) %>%
                  summarize(Group = "All",
                            Weighted_Avg_Total = sum(Count_Dist*Total),
                            Count_Dist = sum(Count_Dist)) %>% 
                  select(Group, Count_Dist, Weighted_Avg_Total))
    

    如果你想包括 Count -column,你也可以(根据我在下面的评论):

    df %>% 
      transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
      bind_rows(df %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count/sum(Count)),
                            Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                            Count = sum(Count)))
    
  • 1

    一种替代方法是使用 mutate 两次计算甚至 Weighted_Total 并在 summarize 中使用该列的 sum .

    df %>%
      mutate(Count_Dist = Count/sum(Count)) %>%
      mutate(Weighted_Total = Count_Dist*Total) %>%
      summarize(Count_Dist = sum(Count_Dist),
                Weighted_Total = sum(Weighted_Total))
    Result:
      Count_Dist Weighted_Total
    1          1     656.074766
    

    和:

    df %>% 
          mutate(Count_Dist = Count/sum(Count)) %>%
          select(Group, Count_Dist, Total) %>% 
          rbind(df %>%
                  mutate(Count_Dist = Count/sum(Count)) %>%
                  mutate(Weighted_Total = Count_Dist*Total) %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count_Dist),
                            Total = sum(Weighted_Total))) %>% 
          rename(Weighted_Avg_Total = Total)
    
    Result:
    
          Group   Count_Dist Weighted_Avg_Total
        1     A 0.0934579439          50.000000
        2     B 0.1401869159         300.000000
        3     C 0.1121495327         600.000000
        4     D 0.1869158879         400.000000
        5     E 0.4672897196        1000.000000
        6   All 1.0000000000         656.074766
    

相关问题