首页 文章

按组重复观察的百分比

提问于
浏览
0

我有一个随时间变化的公司员工数据集,看起来像这样

data.table(firm = c(rep("A", 8), rep("B", 8)), 
           employee = c(1, 2, 3, 4, 1, 2, 3, NA, 5, 6, NA, NA, 5, 6, 7, 8),
           year = c(rep(1, 4), rep(2, 4)))

  firm employee_id year
    A        1    1
    A        2    1
    A        3    1
    A        4    1
    A        1    2
    A        2    2
    A        3    2
    A       NA    2
    B        5    1
    B        6    1
    B       NA    1
    B       NA    1
    B        5    2
    B        6    2
    B        7    2
    B        8    2

我想计算每个公司的年度== 1的员工百分比= = 2 .

输出应该是这样的

firm year continued_employees
 A    2     0.75
 B    2     1

我可以在每年的循环中使用它

sum(employee_id[year==1] %in% employee_id[year==2]) / length(employee_id[year==1])

但我有大约4万家公司和10年的观察 . 有关如何在 dplyrdata.table 语法中执行此操作的任何想法?

3 回答

  • 2

    这是一个不那么漂亮的 data.table 方法,你可以用于任何数量的公司和年份:

    years <- head(sort(unique(dt$year)), -1)
    setNames(lapply(years, function(y) {
      dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
        !is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][, 
          .(continued = mean(V1), year = y+1), by = firm]
    }), paste("Year", years, sep="-"))
    
    #$`Year-1`
    #   firm continued year
    #1:    A      0.75    2
    #2:    B      1.00    2
    

    由于您的样本数据只有两年,因此您只能获得一个列表元素 .

  • 1

    加入转移年份

    这是一种使用一种具有转移年份的自我联接的方法:

    library(data.table)
    options(datatable.print.class = TRUE)
    # self join with shifted year
    DT[.(firm = firm, employee = employee, year = year - 1), 
       on = .(firm, employee, year), cont := TRUE][]
    # aggregate
    DT[!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
      # beautify result
      year <= max(DT$year)]
    

    公司年份V1
    <char> <num> <num>
    1:A 2 0.75
    2:B 2 1.00

    第一个表达式修改 DT 以表示继续雇员:

    坚定的员工年度
    <char> <num> <num> <lgcl>
    1:A 1 1 TRUE
    2:A 2 1是
    3:A 3 1是
    4:A 4 1 NA
    5:A 1 2 NA
    6:A 2 2 NA
    7:A 3 2 NA
    8:NA 2 NA
    9:B 5 1是的
    10:B 6 1是的
    11:B NA 1 NA
    12:B NA 1 NA
    13:B 5 2 NA
    14:B 6 2 NA
    15:B 7 2 NA
    16:B 8 2 NA

    使用shift()

    或者, shift() 函数可用于计算 cont 列 . 聚合部分与上面的连接方法相同 . shift() 要求确保按年份排序数据 .

    DT[order(year), cont := shift(year, type = "lead") == year + 1, by = .(firm, employee)][
      !is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
        year <= max(DT$year)]
    

    基准

    在撰写本文时,除了OP自己尝试使用循环之外,还提出了三种方法:

    Jean Vuda的答案在基准测试中未被考虑,因为它仅限于2年 .

    根据OP, 生产环境 数据集包括40 k公司和10年的数据 . 对于实际的基准测试,会创建一个类似大小的样本数据集:

    n_firm <- 40000L
    max_employee <- 10L
    fluctuation_rate <- 0.2
    n_year <- 10L
    start_year <- 2001L
    
    DT0 <- CJ(firm = sprintf("%06i", seq_len(n_firm)), 
              employee = seq_len(max_employee), 
              year = seq(start_year, length.out = n_year))
    set.seed(123L)
    n_row <- nrow(DT0)
    DT0[sample.int(n_row, fluctuation_rate * n_row), employee := NA]
    

    样本数据集由4 M行组成,在从长格式转换为宽格式后可以最佳地显示:

    dcast(DT0[!is.na(employee)], firm + employee ~ year)
    

    使用'year'作为值列 . 使用'value.var'覆盖
    公司员工2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
    <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
    1:000001 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
    2:000001 2 2001 2002 2003 NA 2005 2006 2007 NA 2009 NA
    3:000001 3 2001 2002 NA NA 2005 2006 2007 2008 2009 2010
    4:000001 4 2001 NA NA NA 2005 2006 2007 2008 NA 2010


    399996:040000 6 2001 2002 NA 2004 2005 NA NA NA 2009 2010
    399997:040000 7 NA 2002 NA NA 2005 2006 2007 2008 2009 2010
    399998:040000 8 2001 2002 2003 NA NA NA 2007 NA NA 2010
    399999:040000 9 2001 2002 2003 NA 2005 2006 2007 2008 NA
    400000:040000 10 2001 2002 2003 NA NA 2006 2007 2008 2009 2010

    对于基准测试,使用 microbenchmark 包是因为可以传递检查函数以验证结果是否相同:

    my_check <- function(values) {
      values <- lapply(values, function(x) x[, dcast(.SD, firm ~ year, value.var = "continued")])
      all(sapply(values[-1], function(x) identical(values[[1]], x)))
    }
    

    基准代码:

    microbenchmark::microbenchmark(
      dd = {
        dt <- copy(DT0)
        years <- head(sort(unique(dt$year)), -1)
        rbindlist(
          setNames(lapply(years, function(y) {
            dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
              !is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][
                , .(continued = mean(V1), year = y+1), by = firm]
          }), paste("Year", years, sep="-"))
        )
      },
      join = {
        DT <- copy(DT0)
        DT[.(firm = firm, employee = employee, year = year - 1), 
           on = .(firm, employee, year), cont := TRUE][
             !is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N), 
             by = .(firm, year = year + 1)][
               year <= max(DT$year)]
      },
      shift = {
        DT <- copy(DT0)
        DT[order(year), cont := shift(year, type = "lead") == year + 1, 
           by = .(firm, employee)][
             !is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N), 
             by = .(firm, year = year + 1)][
               year <= max(DT$year)]
      },
      check = my_check,
      times = 3L
    )
    

    基准测试结果表明,连接方法比移位方法快4倍,比docendo discimus方法快8倍 .

    单位:秒
    expr min lq mean中位数uq max neval cld
    dd 11.756114 11.919959 12.083042 12.083805 12.246506 12.409207 3 c
    加入1.054293 1.239829 1.303971 1.425366 1.428810 1.432254 3 a
    shift 6.105725 6.105906 6.148136 6.106087 6.169342 6.232596 3 b

  • 0

    这是一个稍微不同的方法:

    dt<-dat[,list(all=.(unique(employee))), by=list(year,firm)]
    dt<-dt[,list(year1=sapply(list(all),`[`,1), 
                 year2=sapply(list(all),`[`,2)), by=firm]
    dt[,uniqueN(mapply(intersect, year1, year2))/uniqueN(na.omit(unlist(year1))),by=firm]
    

相关问题