首页 文章

ddply总结比例计数

提问于
浏览
5

我在使用plyr包中的ddply函数时遇到了一些麻烦 . 我试图用每组中的计数和比例来总结以下数据 . 这是我的数据:

structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L, 
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"), 
    X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L, 
    2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L, 
    7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L, 
    2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L, 
    2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L, 
    13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1  and 8", 
    "2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8", 
    "1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")

这是我使用ddply的电话:

ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)

这正确地给了我 X5employff 的每个实例的计数,但似乎在每行中计算比例而不是在因子 X5employf 的每个级别内,如下所示:

X5employf X5employff  n prop
1   increase          1 26  100
2   increase          2  1  100
3   increase          3 15  100
4   increase    1 and 8  1  100
5   increase       <NA>  1  100
6   decrease          4  1  100
7   decrease          5  5  100
8   decrease          6  2  100
9   decrease          7  1  100
10  decrease          8  1  100
11      same          4  4  100
12      same          5  6  100
13      same          6  5  100
14      same    6 and 7  3  100
15      same          7  1  100

当手动计算每个组中的比例时,我得到:

X5employf X5employff  n prop
1   increase          1 26  59.09
2   increase          2  1  2.27
3   increase          3 15  34.09
4   increase    1 and 8  1  2.27
5   increase       <NA>  1  2.27
6   decrease          4  1  10.00
7   decrease          5  5  50.00
8   decrease          6  2  20.00
9   decrease          7  1  10.00
10  decrease          8  1  10.00
11      same          4  4  21.05
12      same          5  6  31.57
13      same          6  5  26.31
14      same    6 and 7  3  15.78
15      same          7  1  5.26

正如您所看到的,因子X5employf的每个级别的比例总和等于100 .

我知道这可能是非常简单的,但尽管阅读了各种类似的帖子,但我似乎无法理解它 . 谁能帮助解决这个问题以及我对总结功能如何运作的理解?!

非常感谢

马蒂

3 回答

  • 6

    您无法在一个 ddply 调用中执行此操作,因为传递给每个 summarize 调用的内容是您的组变量的特定组合的数据子集 . 在此最低级别,您无权访问该中间级别 sum(n) . 相反,分两步完成:

    kano_final <- ddply(kano_final, .(X5employf), transform,
                        sum.n = length(X5employf))
    
    ddply(kano_final, .(X5employf, X5employff), summarise, 
          n = length(X5employff), prop = n / sum.n[1] * 100)
    

    Edit :使用单个 ddply 调用并使用 table ,因为您暗示:

    ddply(kano_final, .(X5employf), summarise,
          n          = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
          prop       = 100* prop.table(n),
          X5employff = names(n))
    
  • 0

    我在这里添加一个dplyr示例,它可以很容易地在一个步骤中使用短代码和易于阅读的语法 .

    d是你的data.frame

    library(dplyr)
    d%.%
      dplyr:::group_by(X5employf, X5employff) %.%
      dplyr:::summarise(n = length(X5employff)) %.%
      dplyr:::mutate(ngr = sum(n)) %.% 
      dplyr:::mutate(prop = n/ngr*100)
    

    会导致

    Source: local data frame [15 x 5]
    Groups: X5employf
    
       X5employf X5employff  n ngr      prop
    1   increase          1 26  44 59.090909
    2   increase          2  1  44  2.272727
    3   increase          3 15  44 34.090909
    4   increase    1 and 8  1  44  2.272727
    5   increase         NA  1  44  2.272727
    6   decrease          4  1  10 10.000000
    7   decrease          5  5  10 50.000000
    8   decrease          6  2  10 20.000000
    9   decrease          7  1  10 10.000000
    10  decrease          8  1  10 10.000000
    11      same          4  4  19 21.052632
    12      same          5  6  19 31.578947
    13      same          6  5  19 26.315789
    14      same    6 and 7  3  19 15.789474
    15      same          7  1  19  5.263158
    
  • 1

    您显然想要做的是找出X5employff对X5employf的每个值的比例 . 但是,你没有告诉ddply X5employf和X5employff是不同的;对于ddply来说,这两个变量只是两个分开数据的变量 . 此外,由于每行有一个观察点,即每个数据行的count = 1,每个(X5employf,X5employff)组合的长度等于每个(X5employf,X5employff)组合的总和 .

    我能想到的解决问题的最简单的“plyr方式”如下:

    result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
    n <- result$n
    n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
    result <- data.frame(result, prop=n/rep(n2, each=13)*100)
    

    你也可以使用好的旧xtabs:

    a <- xtabs(~X5employf + X5employff, kano_final)
    b <- xtabs(~X5employf, kano_final)
    a/matrix(b, nrow=3, ncol=ncol(a))
    

相关问题