首页 文章

加入(plyr库),奇怪的错误

提问于
浏览
0

我正在尝试在R中编写一个函数来计算给定收入和人口份额的基尼系数(收入不平等系数) . 这就是我想要做的:

incomes <- c(1175,1520,1865,2210,2555) # incomes
population <- rep(1/5,5)*100           # population shares (5 times 1/5)

income <- incomes*population/sum(incomes*population) # income * frequency / total income
data <- as.data.frame(cbind(incomes,income,population/100))
names(data) <- c("incomes","income","population")

data <- data[order(as.numeric(data$incomes)),] # sort by percentage of income

for (i in 1:length(income)){
    data$richer[i] <- 1-sum(data$population[1:i])
}
data$score <- data$income * (data$population + 2 * data$richer)
gini <- round(1-sum(data$score),4) # gini

这一切都运作良好 . 但现在我想绘制收入分配图,为此我创建了一个新的数据集:

data$population2 <- data$richer + data$population # cumulative
x <- as.data.frame(matrix(data=NA,ncol=1,nrow=20))
names(x) <- c("population2")
x$population2 <- rev(seq(0.05,1,0.05))

data.graph <- join(x, data, by = "population2")

所以'data $ population2'变量的值为1,0.8,0.6,0.4,0.2,x $ population2的值为1,0.95,0.9,0.85,0.8等,直到0.05 . 但是,join函数只加入值为1,0.8,0.2,而不是0.6和0.4的值!谁能帮我吗?

1 回答

  • 0

    欢迎来到first circle of R hell . :)

    乍一看, data$population2 中的所有值看起来都应该在 x$population2 中匹配:

    > x$population2
     [1] 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
    > data$population2
    [1] 1.0 0.8 0.6 0.4 0.2
    

    但事实并非如此:

    > x$population2[9]
    [1] 0.6
    > data$population2[3]
    [1] 0.6
    
    > data$population2[3] == x$population2[9]
    [1] FALSE
    > all.equal(data$population2[3], x$population2[9]) 
    [1] TRUE
    # all.equal tolerates numerical differences smaller than 1.5e-8 by default
    
    > print(x$population2[9], digits = 20)
    [1] 0.60000000000000009
    > print(data$population2[3], digits = 20)
    [1] 0.59999999999999987
    

    以下内容适用于示例案例,但我要小心不要在每个场景中应用它,而不考虑四舍五入的小数位数是否合适 . 通常,使用字符键执行连接更安全:

    library(plyr); library(dplyr)
    
    join(x %>% mutate(population2 = round(population2, 3)), 
         data%>% mutate(population2 = round(population2, 3)), 
         by = "population2")
    
       population2 incomes    income population richer      score
    1         1.00    1175 0.1260054        0.2    0.8 0.22680965
    2         0.95      NA        NA         NA     NA         NA
    3         0.90      NA        NA         NA     NA         NA
    4         0.85      NA        NA         NA     NA         NA
    5         0.80    1520 0.1630027        0.2    0.6 0.22820375
    6         0.75      NA        NA         NA     NA         NA
    7         0.70      NA        NA         NA     NA         NA
    8         0.65      NA        NA         NA     NA         NA
    9         0.60    1865 0.2000000        0.2    0.4 0.20000000
    10        0.55      NA        NA         NA     NA         NA
    11        0.50      NA        NA         NA     NA         NA
    12        0.45      NA        NA         NA     NA         NA
    13        0.40    2210 0.2369973        0.2    0.2 0.14219839
    14        0.35      NA        NA         NA     NA         NA
    15        0.30      NA        NA         NA     NA         NA
    16        0.25      NA        NA         NA     NA         NA
    17        0.20    2555 0.2739946        0.2    0.0 0.05479893
    18        0.15      NA        NA         NA     NA         NA
    19        0.10      NA        NA         NA     NA         NA
    20        0.05      NA        NA         NA     NA         NA
    

    作为旁注,在前面的步骤中使用for循环,您可以执行以下操作:

    library(dplyr)
    
    # use this
    data <- data %>% mutate(richer = 1-cumsum(population))
    
    # instead of this
    for (i in 1:length(income)){
      data$richer[i] <- 1-sum(data$population[1:i])
    }
    

    对于循环操作在R中相对较慢(在较大的数据集中可见) . R针对矢量化操作进行了优化 .

相关问题