首页 文章

在R中重新整形长到宽的数据集时,有条件地填充缺失值

提问于
浏览
3

我正在基于多个质量不同的数据集构建一组年份和国家的完整指标时间表 .

使用 reshape2 我将"melted"这些数据集放入一个数据帧中 .

示例数据集:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

我希望我可以使用 cast()fun.aggregate 将这个长数据集转换为宽格式,同时为给定的国家/年组合选择最高质量的数据集(欧盟统计局>经合组织>理事会)来填补缺失 . 不幸的是,我真的不明白如何使用这样的自定义聚合函数 .

换句话说,我希望将数据集从长格式重新整形为宽格式,同时根据因子(“源”)的值合并多个值 . 理想情况下,它可以工作:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

并且可选地(或直接)进入宽格式:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

4 回答

  • 1

    假设数据是您所需的顺序,即列 source 首先由 Eurostat 排序,然后由 OECD 排序,然后由 council 排序,我将以这种方式使用 data.table

    require(data.table) # >= v1.9.0
    setDT(d) # converts data.frame to data.table by reference
    dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
     subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
    
    #    cntry  sex 1960 1970 1980
    # 1:    BE male  5.5  1.2  1.5
    # 2:    DE male  2.3  1.4   NA
    # 3:    GE male   NA  1.4   NA
    
  • 1

    我不确定这是否满足您的所有期望,但听起来您正在寻找以下内容:

    toMerge <- expand.grid(cntry = c("BE", "DE", "GE"), 
                           year = c(1960, 1970, 1980), 
                           source = c("Eurostat", "OECD", "Council"), 
                           sex = "male")
    d2 <- merge(d, toMerge, all = TRUE)
    
    d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
    d2 <- d2[order(d2$source, decreasing=TRUE), ]
    Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")], 
                     FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
    D <- d2[Rank == 1, ]
    D
    #    cntry year  sex   source indicator
    # 2     BE 1960 male Eurostat       5.5
    # 5     BE 1970 male Eurostat       1.2
    # 8     BE 1980 male Eurostat       1.5
    # 14    DE 1970 male Eurostat       1.4
    # 17    DE 1980 male Eurostat        NA
    # 20    GE 1960 male Eurostat        NA
    # 26    GE 1980 male Eurostat        NA
    # 12    DE 1960 male     OECD       2.3
    # 24    GE 1970 male     OECD       1.4
    
    library(reshape2)
    dcast(D, cntry ~ year, value.var="indicator")
    #   cntry 1960 1970 1980
    # 1    BE  5.5  1.2  1.5
    # 2    DE  2.3  1.4   NA
    # 3    GE   NA  1.4   NA
    
  • 0

    也许以下内容也可以起作用:

    library(reshape2)
    x <- melt(d,id.vars=c("cntry","year","source","sex"))
    y <- dcast(x,cntry+year+sex ~ source)
    y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
    dcast(y,cntry + sex ~ year)
    

    源选择使用分层 ifelse 语句进行 . 使用此方法会丢失所选源的指示,如果这是一个问题,则可以添加类似的 ifelse 语句,从而创建源origin变量:

    y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
    
  • 2

    这是另一种选择:

    library(reshape2)
    d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
    d2 <- d[1:4]
    d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
    dcast(
      d2, cntry + sex ~ year, value.var="indicator", 
      fun.aggregate=function(x) {
        if(!length(x)) return(NA_real_)
        xs <- do.call(rbind, x)
        xs <- xs[complete.cases(xs), ]
        if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
    } )
    

    生产环境 :

    cntry  sex  1960  1970  1980
    1    BE male 105.5 101.2 101.5
    2    DE male   2.3 101.4    NA
    3    GE male    NA   1.4    NA
    

    注意我在“Eurostat”值中添加了100,以使它们与其他值区别开来,因为在此示例集中它们似乎相等 .

    基本上,我们通过将 indicator 列转换为包含指标和源的列表项列来作弊,然后我们使用 fun.aggregate 从具有最低源值的每个组中选择项目(注意我们重置因子以便最合适来源的水平最低) .

相关问题