首页 文章

data.frame中每组的平均值[重复]

提问于
浏览
123

这个问题在这里已有答案:

我有 data.frame ,我需要计算每组的平均值(即每 Month ,下面) .

Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32

我想要的输出如下所示,其中 Rate1Rate2 的值是组的意思 . 请忽略这个值,我已经为这个例子做了补充 .

Name       Rate1       Rate2
Aira        23.21       12.2
Ben         45.23       43.9
Cat         33.22       32.2

8 回答

  • 8

    这种类型的操作正是 aggregate 的设计目的:

    d <- read.table(text='Name     Month  Rate1     Rate2
    Aira       1      12        23
    Aira       2      18        73
    Aira       3      19        45
    Ben        1      53        19
    Ben        2      22        87
    Ben        3      19        45
    Cat        1      22        87
    Cat        2      67        43
    Cat        3      45        32', header=TRUE)
    
    aggregate(d[, 3:4], list(d$Name), mean)
    
      Group.1    Rate1    Rate2
    1    Aira 16.33333 47.00000
    2     Ben 31.33333 50.33333
    3     Cat 44.66667 54.00000
    

    在这里,我们聚合data.frame d 的第3列和第4列,按 d$Name 分组,并应用 mean 函数 .


    或者,使用公式界面:

    aggregate(. ~ Name, d[-2], mean)
    
  • 8

    或者使用dplyr包中的 group_bysummarise_at

    library(dplyr)
    
    d %>%
      group_by(Name) %>%
      summarise_at(vars(-Month), funs(mean(., na.rm=TRUE)))
    
    # A tibble: 3 x 3
      Name  Rate1 Rate2
      <fct> <dbl> <dbl>
    1 Aira   16.3  47.0
    2 Ben    31.3  50.3
    3 Cat    44.7  54.0
    

    有关指定要操作的变量的多种方法,请参阅 ?summarise_at . 这里, vars(-Month) 表示除 Month 之外的所有变量 .

  • 4

    你也可以使用包 plyr ,这在某种程度上更通用:

    library(plyr)
    
    ddply(d, .(Name), summarize,  Rate1=mean(Rate1), Rate2=mean(Rate2))
    
      Name    Rate1    Rate2
    1 Aira 16.33333 47.00000
    2  Ben 31.33333 50.33333
    3  Cat 44.66667 54.00000
    
  • 32

    第三个很好的选择是使用包 data.table ,它也有类data.frame,但是你正在寻找的操作计算速度要快得多 .

    library(data.table)
    mydt <- structure(list(Name = c("Aira", "Aira", "Aira", "Ben", "Ben", "Ben", "Cat", "Cat", "Cat"), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(15.6396600443877, 2.15649279424609, 6.24692918928743, 2.37658797276116, 34.7500663272292, 3.28750138697048, 29.3265553981065, 17.9821839334431, 10.8639802575958), Rate2 = c(17.1680489538369, 5.84231656330206, 8.54330866437461, 5.88415184986176, 3.02064294862551, 17.2053351400752, 16.9552950199166, 2.56058000170089, 15.7496228048122)), .Names = c("Name", "Month", "Rate1", "Rate2"), row.names = c(NA, -9L), class = c("data.table", "data.frame"))
    

    现在为每个人(名称)取所有3个月的Rate1和Rate2的平均值:首先,确定您想要取的平均值

    colstoavg <- names(mydt)[3:4]
    

    现在我们使用lapply来取平均值而不是我们想要的列(colstoavg)

    mydt.mean <- mydt[,lapply(.SD,mean,na.rm=TRUE),by=Name,.SDcols=colstoavg]
    
     mydt.mean
       Name     Rate1     Rate2
    1: Aira  8.014361 10.517891
    2:  Ben 13.471385  8.703377
    3:  Cat 19.390907 11.755166
    
  • 13

    我描述了两种方法,一种基于 data.table ,另一种基于 reshape2 包 . data.table方式已经有了答案,但我试图让它更清洁,更详细 .

    数据是这样的:

    d <- structure(list(Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
    3L, 3L), .Label = c("Aira", "Ben", "Cat"), class = "factor"), 
        Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(12L, 
        18L, 19L, 53L, 22L, 19L, 22L, 67L, 45L), Rate2 = c(23L, 73L, 
        45L, 19L, 87L, 45L, 87L, 43L, 32L)), .Names = c("Name", "Month", 
    "Rate1", "Rate2"), class = "data.frame", row.names = c(NA, -9L
    ))
    head(d)
      Name Month Rate1 Rate2
    1 Aira     1    12    23
    2 Aira     2    18    73
    3 Aira     3    19    45
    4  Ben     1    53    19
    5  Ben     2    22    87
    6  Ben     3    19    45
    
    
    library("reshape2")
    mym <- melt(d, id = c("Name"))
    res <- dcast(mym, Name ~ variable, mean)
    res
    #Name Month    Rate1    Rate2
    #1 Aira     2 16.33333 47.00000
    #2  Ben     2 31.33333 50.33333
    #3  Cat     2 44.66667 54.00000
    

    使用data.table:

    # At first, I convert the data.frame to data.table and then I group it 
    setDT(d)
    d[, .(Rate1 = mean(Rate1), Rate2 = mean(Rate2)), by = .(Name)]
    #   Name    Rate1    Rate2
    #1: Aira 16.33333 47.00000
    #2:  Ben 31.33333 50.33333
    #3:  Cat 44.66667 54.00000
    

    还有另一种方法可以避免使用.SD在data.table中为j编写许多参数

    d[, lapply(.SD, mean), by = .(Name)]
    #   Name Month    Rate1    Rate2
    #1: Aira     2 16.33333 47.00000
    #2:  Ben     2 31.33333 50.33333
    #3:  Cat     2 44.66667 54.00000
    

    如果我们只想要Rate1和Rate2那么我们可以使用.SDcols如下:

    d[, lapply(.SD, mean), by = .(Name), .SDcols = 3:4]
    #  Name    Rate1    Rate2
    #1: Aira 16.33333 47.00000
    #2:  Ben 31.33333 50.33333
    #3:  Cat 44.66667 54.00000
    
  • 5

    以下是基于 R 的各种方法,包括另一种方法 aggregate . 以下示例返回每月的工具,我认为这是您所要求的 . 虽然,可以使用相同的方法返回每人的手段:

    使用 ave

    my.data <- read.table(text = '
         Name     Month  Rate1     Rate2
         Aira       1      12        23
         Aira       2      18        73
         Aira       3      19        45
         Ben        1      53        19
         Ben        2      22        87
         Ben        3      19        45
         Cat        1      22        87
         Cat        2      67        43
         Cat        3      45        32
    ', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
    
    Rate1.mean <- with(my.data, ave(Rate1, Month, FUN = function(x) mean(x, na.rm = TRUE)))
    Rate2.mean <- with(my.data, ave(Rate2, Month, FUN = function(x) mean(x, na.rm = TRUE)))
    
    my.data <- data.frame(my.data, Rate1.mean, Rate2.mean)
    my.data
    

    使用 by

    my.data <- read.table(text = '
         Name     Month  Rate1     Rate2
         Aira       1      12        23
         Aira       2      18        73
         Aira       3      19        45
         Ben        1      53        19
         Ben        2      22        87
         Ben        3      19        45
         Cat        1      22        87
         Cat        2      67        43
         Cat        3      45        32
    ', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
    
    by.month <- as.data.frame(do.call("rbind", by(my.data, my.data$Month, FUN = function(x) colMeans(x[,3:4]))))
    colnames(by.month) <- c('Rate1.mean', 'Rate2.mean')
    by.month <- cbind(Month = rownames(by.month), by.month)
    
    my.data <- merge(my.data, by.month, by = 'Month')
    my.data
    

    使用 lapplysplit

    my.data <- read.table(text = '
         Name     Month  Rate1     Rate2
         Aira       1      12        23
         Aira       2      18        73
         Aira       3      19        45
         Ben        1      53        19
         Ben        2      22        87
         Ben        3      19        45
         Cat        1      22        87
         Cat        2      67        43
         Cat        3      45        32
    ', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
    
    ly.mean <- lapply(split(my.data, my.data$Month), function(x) c(Mean = colMeans(x[,3:4])))
    ly.mean <- as.data.frame(do.call("rbind", ly.mean))
    ly.mean <- cbind(Month = rownames(ly.mean), ly.mean)
    
    my.data <- merge(my.data, ly.mean, by = 'Month')
    my.data
    

    使用 sapplysplit

    my.data <- read.table(text = '
         Name     Month  Rate1     Rate2
         Aira       1      12        23
         Aira       2      18        73
         Aira       3      19        45
         Ben        1      53        19
         Ben        2      22        87
         Ben        3      19        45
         Cat        1      22        87
         Cat        2      67        43
         Cat        3      45        32
    ', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
    my.data
    
    sy.mean <- t(sapply(split(my.data, my.data$Month), function(x) colMeans(x[,3:4])))
    colnames(sy.mean) <- c('Rate1.mean', 'Rate2.mean')
    sy.mean <- data.frame(Month = rownames(sy.mean), sy.mean, stringsAsFactors = FALSE)
    my.data <- merge(my.data, sy.mean, by = 'Month')
    my.data
    

    使用 aggregate

    my.data <- read.table(text = '
         Name     Month  Rate1     Rate2
         Aira       1      12        23
         Aira       2      18        73
         Aira       3      19        45
         Ben        1      53        19
         Ben        2      22        87
         Ben        3      19        45
         Cat        1      22        87
         Cat        2      67        43
         Cat        3      45        32
    ', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
    
    my.summary <- with(my.data, aggregate(list(Rate1, Rate2), by = list(Month), 
                       FUN = function(x) { mon.mean = mean(x, na.rm = TRUE) } ))
    
    my.summary <- do.call(data.frame, my.summary)
    colnames(my.summary) <- c('Month', 'Rate1.mean', 'Rate2.mean')
    my.summary
    
    my.data <- merge(my.data, my.summary, by = 'Month')
    my.data
    
  • 40

    您也可以使用通用函数 cbind()lm() 而不使用截距:

    cbind(lm(d$Rate1~-1+d$Name)$coef,lm(d$Rate2~-1+d$Name)$coef)
    >               [,1]     [,2]
    >d$NameAira 16.33333 47.00000
    >d$NameBen  31.33333 50.33333
    >d$NameCat  44.66667 54.00000
    
  • 184

    您也可以使用 sqldf 包完成此操作,如下所示:

    library(sqldf)
    
    x <- read.table(text='Name     Month  Rate1     Rate2
    Aira       1      12        23
                    Aira       2      18        73
                    Aira       3      19        45
                    Ben        1      53        19
                    Ben        2      22        87
                    Ben        3      19        45
                    Cat        1      22        87
                    Cat        2      67        43
                    Cat        3      45        32', header=TRUE)
    
    sqldf("
    select 
      Name
      ,avg(Rate1) as Rate1_float
      ,avg(Rate2) as Rate2_float
      ,avg(Rate1) as Rate1
      ,avg(Rate2) as Rate2
    from x
    group by 
      Name
    ")
    
    #  Name Rate1_float Rate2_float Rate1 Rate2
    #1 Aira    16.33333    47.00000    16    47
    #2  Ben    31.33333    50.33333    31    50
    #3  Cat    44.66667    54.00000    44    54
    

    我最近转换为 dplyr ,如其他答案中所示,但 sqldf 很好,因为大多数数据分析师/数据科学家/开发人员至少在SQL中流利 . 通过这种方式,我认为它倾向于提供比 dplyr 或上面提出的其他解决方案更普遍可读的代码 .

    UPDATE: 在回复下面的评论时,我试图更新代码,如上所示 . 但是,这种行为并不像我预期的那样 . 似乎列定义(即 int vs float )仅在列别名与原始列名匹配时才会执行 . 指定新名称时,将返回聚合列而不进行舍入 .

相关问题