首页 文章

dplyr:group_by并在创建新变量时使用汇总值进行汇总

提问于
浏览
0

我正在处理一个数据帧,我正在使用group_by并汇总使用dplyr获得一些结果 . 但是,我打算在汇总时生成的变量之一需要根据分组变量的值访问第二个数据帧值,我无法猜测如何做到这一点 . 这是一个例子 .

这些是我的2 df:

ExampleData <- structure(list(country = structure(c(5L, 5L, 5L, 1L, 1L, 1L, 
                                                    4L, 4L, 4L, 2L, 2L, 2L), .Label = c("Bolivia", "Colombia", "Ecuador", 
                                                                                        "Peru", "Venezuela"), class = "factor"), area = c(21962759.1957539, 
                                                                                                                                          6116515271.82745, 4420526.44962988, 950155731.837125, 3284949253.71748, 
                                                                                                                                          13008533744.7177, 181171.153229255, 724458.059924146, 545485754.118267, 
                                                                                                                                          646585511.365563, 5586512056.6131, 4025165194.1968)), .Names = c("country", 
                                                                                                                                                                                                           "area"), row.names = c(0L, 1L, 2L, 87L, 88L, 89L, 117L, 118L, 
country.areas <- structure(list(country = c("Bolivia", "Colombia", "Ecuador", 
                                            "Peru", "Venezuela"), area = c(1090353, 1141962, 256932, 1296912, 
                                                                           916560.5)), .Names = c("country", "area"), row.names = c(NA, 
                                                                                                                                    5L), class = "data.frame")
> head(ExampleData)
     country        area
0  Venezuela    21962759
1  Venezuela  6116515272
2  Venezuela     4420526
87   Bolivia   950155732
88   Bolivia  3284949254
89   Bolivia 13008533745
> head(country.areas)
    country      area
1   Bolivia 1090353.0
2  Colombia 1141962.0
3   Ecuador  256932.0
4      Peru 1296912.0
5 Venezuela  916560.5

现在,我希望使用ExampleData,通过 group_by country 字段和 summarise 生成变量 PercOfCountry ,这是每个国家/地区的总和区域除以国家/地区的总面积,取自 country.areas . 我正在尝试:

by.country <- ExampleData %>% 
  group_by(country) %>% 
  summarise(km2.country = sum(area)/1000000,
            PercOfCountry = km2.country/country.ares$area[country.areas$country == country])

最后 country (最后一个单词)想要引用在group_by中考虑的国家区域,取自df country.areas(例如:玻利维亚的1090353.0) . km2.country 部分按预期工作......我只想将该值除以该国家的面积,因此我得到一个百分比 . 当然,我可以很容易地在下一步做到这一点......但我正在努力学习dplyr,而且我仍然很难理解 group_by 函数的哪些功能似乎很强大 .

谢谢!

1 回答

  • 3

    这应该做到......

    by.country <- ExampleData %>% group_by(country) %>% 
                          summarise(km2.country=sum(area)/1000000) %>% 
                          left_join(country.areas) %>% #note this brings in a new variable also called area
                          mutate(PercOfCountry=km2.country/area)
    
    by.country
    # A tibble: 2 × 4
        country km2.country      area PercOfCountry
          <chr>       <dbl>     <dbl>         <dbl>
    1   Bolivia   17243.639 1090353.0    0.01581473
    2 Venezuela    6142.899  916560.5    0.00670212
    

相关问题