首页 文章

使用dplyr或datatable每年的公司数量

提问于
浏览
6

让我们说我有数据框:

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
                 YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
                 YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))

其中YearFrom是例如公司成立,YearTo是取消的年份 . 如果YearTo是NA,那么它仍在工作 .

我想计算每年的公司数量 .

该表应如下所示

City    |"Year"   |"Count"
"NY"    |2001       1
"NY"    |2002       2
"NY"    |2003       3
"NY"    |2004       3
"NY"    |2005       2
"NY"    |2006       3
"NY"    |2007       3
"NY"    |2008       4
"NY"    |2009       3
"LA"    |2001       0
"LA"    |2002       1
"LA"    |2003       1
"LA"    |2004       2
"LA"    |2005       4
"LA"    |2006       4
"LA"    |2007       4
"LA"    |2008       2
"LA"    |2009       2

我想通过dplyr或datatable包解决这个问题,但我无法弄清楚如何?

4 回答

  • 8

    一个较短的 tidyverse 解决方案 .

    # Firsts some data prep
    df <- mutate(df,
        YearFrom = as.numeric(as.character(YearFrom)),                     #Fix year coding
        YearTo = as.numeric(as.character(YearTo)),
        YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max
    
    df %>% 
      mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>%          #Find all years
      unnest() %>%                                                 #Spread over rows
      count(Years, City) %>%                                       #Count them
      complete(City, Years, fill = list(n = 0))                    #Add in zeros, if needed
    
  • 7

    首先,清理数据......

    curr_year = as.integer(year(Sys.Date()))
    
    library(data.table)
    setDT(df)
    df[, YearTo := as.integer(as.character(YearTo)) ]
    df[, YearFrom := as.integer(as.character(YearFrom)) ]
    df[, quasiYearTo := YearTo ]
    df[is.na(YearTo), quasiYearTo := curr_year ]
    

    然后,非equi连接:

    df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
      on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
      .N
    , by=.EACHI][, .(City, Year = YearFrom, N)]
    
        City Year N
     1:   LA 2001 0
     2:   LA 2002 1
     3:   LA 2003 1
     4:   LA 2004 2
     5:   LA 2005 4
     6:   LA 2006 4
     7:   LA 2007 4
     8:   LA 2008 3
     9:   LA 2009 3
    10:   NY 2001 1
    11:   NY 2002 2
    12:   NY 2003 3
    13:   NY 2004 3
    14:   NY 2005 2
    15:   NY 2006 3
    16:   NY 2007 3
    17:   NY 2008 4
    18:   NY 2009 3
    
  • 5

    以下是使用 data.table 的答案 . 数据准备在底部 .

    # get list of businesses, one obs per year of operation
    cityList <- lapply(seq_len(nrow(df)),
                  function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])
    
    # combine to a single data.table
    dfNew <- rbindlist(cityList)
    
    # get counts
    dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]
    

    写在一行,这是

    # get the counts
    rbindlist(lapply(seq_len(nrow(df)),
              function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
      by=.(City, Year)]
    

    在这里, lapply 遍历每一行并构造一个data.table,其中重复的城市值为一列,第二列为操作年份 . 在这里,YearTo递减,因此它不包括关闭年份 . 请注意,在数据准备中,缺失值设置为2018,以便包含当前年份 .

    lapply 返回一个data.tables列表,它通过 rbindlist 组合成一个data.table . 此data.table汇总到城市 - 年对,计数使用 .N 构建 .

    这些回归

    City Year Count
     1:   NY 2001     1
     2:   NY 2002     2
     3:   NY 2003     3
     4:   NY 2004     3
     5:   NY 2005     2
     6:   NY 2006     3
     7:   NY 2007     3
      ...
    26:   LA 2012     3
    27:   LA 2013     3
    28:   LA 2014     3
    29:   LA 2015     3
    30:   LA 2016     3
    31:   LA 2017     3
    32:   LA 2002     1
    33:   LA 2003     1
    

    data

    setDT(df)
    # convert string years to integers
    df[, grep("Year", names(df), value=TRUE) := 
       lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
    # replace NA values with 2018 (to include 2017 in count)
    df[is.na(YearTo), YearTo := 2018]
    
  • 2

    此解决方案使用 dplyrtidyr .

    library(dplyr)
    library(tidyr)
    
    df %>%
      # Change YearFrom and YearTo to numeric
      mutate(YearFrom = as.numeric(as.character(YearFrom)), 
             YearTo = as.numeric(as.character(YearTo))) %>%
      # Replace NA with 2017 in YearTo
      mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>%
      # All number in YearTo minus 1 to exclude the year of cancellation
      mutate(YearTo = YearTo - 1) %>%
      # Group by row
      rowwise() %>%
      # Create a tbl for each row, expand the Year column based on YearFrom and YearTo
      do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>%
      ungroup() %>%
      # Count the number of each City and Year
      count(City, Year) %>%
      # Rename the column n to Count
      rename(Count = n) %>%
      # Spread the data frame to find the implicity missing value in LA, 2001
      spread(Year, Count) %>%
      # Gather the data frame to account for the missing value in LA, 2001
      gather(Year, Count, - City) %>%
      # Replace NA with 0 in Count
      mutate(Count = ifelse(is.na(Count), 0L, Count)) %>%
      # Arrange the data 
      arrange(desc(City), Year) %>%
      # Filter the data until 2009
      filter(Year <= 2009)
    

相关问题