首页 文章

从重叠日期计算活动天数/月数

提问于
浏览
2

我有数据列出了大量客户的不同产品的开始和结束日期 . 不同产品的间隔可能会重叠或在购买之间存在时间差:

library(lubridate)
library(Hmisc)
library(dplyr)

user_id <- c(rep(12, 8), rep(33, 5))

start_date <- dmy(Cs(31/10/2010,    18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016))

end_date <- dmy(Cs(31/10/2011,  18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017,
               01/07/2016,  16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017))

data <- data.frame(user_id, start_date, end_date)

data
   user_id start_date   end_date
1       12 2010-10-31 2011-10-31
2       12 2010-12-18 2011-12-18
3       12 2011-10-31 2014-04-28
4       12 2011-12-18 2014-12-18
5       12 2014-03-27 2015-03-27
6       12 2014-12-18 2016-12-18
7       12 2015-03-27 2016-03-27
8       12 2016-12-18 2017-12-18
9       33 1992-07-01 2016-07-01
10      33 1993-08-20 2016-08-16
11      33 1999-10-28 2012-11-15
12      33 2006-01-31 2006-02-28
13      33 2016-08-26 2017-01-26

I'd like to calculate the total number of active days or months during which he/she held any the products .

如果产品总是重叠就不会有问题,因为我可以简单地采用

data %>% 
group_by(user_id) %>% 
dplyr::summarize(time_diff = max(end_date) - min(start_date))

但是,正如您在用户33中看到的那样,产品并不总是重叠,并且它们的间隔必须分别添加到所有“重叠”间隔 .

是否有一种快速而优雅的方式对其进行编码,希望在 dplyr 中?

3 回答

  • 3

    我们可以使用 dplyr 中的函数来计算总天数 . 以下示例展开每个时间段,然后删除重复的日期 . 最后计算每个 user_id 的总行数 .

    data2 <- data %>%
      rowwise() %>%
      do(data_frame(user_id = .$user_id, 
         Date = seq(.$start_date, .$end_date, by = 1))) %>%
      distinct() %>%
      ungroup() %>%
      count(user_id)
    
  • 2

    关于使用 IRangesintersect 怎么样?

    library(IRanges)
    data %>% 
      group_by(user_id) %>% 
      summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date)))))) 
    # A tibble: 2 × 2
      user_id active_days
        <dbl>       <int>
    1      12        2606
    2      33        8967
    

    在这里使用Nathan Wert的基准 big_data . IRange方法看起来要快一点 .

    my_result <- function(x) {
    x %>% 
        group_by(user_id) %>% 
        summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date)))))) 
    }
    
    
    library(microbenchmark)
    microbenchmark(
      a <- my_result(big_data),
      b <- my_answer(big_data), times=2
    )
    Unit: seconds
                         expr      min       lq     mean   median       uq      max neval cld
     a <- my_result(big_data) 14.97008 14.97008 14.98896 14.98896 15.00783 15.00783     2  a 
     b <- my_answer(big_data) 17.59373 17.59373 17.76257 17.76257 17.93140 17.93140     2   b
    
    all.equal(a, b)
    [1] TRUE
    

    编辑

    为了可视化范围,您还可以绘制数据......

    library(Gviz)
    library(GenomicRanges)
    a <- sapply(split(data, data$user_id), function(x) {
      AnnotationTrack(start = as.numeric(x$start_date), end = as.numeric(x$end_date),
                      chromosome = "chrNA", stacking = "full", name = as.character(unique(x$user_id)))
    })
    plotTracks(trackList = a)
    

    enter image description here

  • 2

    使 data.frame 效率不高,因此您可以通过将范围保持为 Date 向量来节省时间 .

    multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)
    
    data %>%
      group_by(user_id) %>%
      mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
      summarise(days_held = length(unique(unlist(date_seq))))
    

    我是一个更为惯用的写作方式,但我不是一个整齐的人 .

    multi_seq_date 将返回日期序列列表 . 然后它在一个随机生成的大样本集上回答:

    # Making the data -----------------------------------
    big_size <- 100000
    starting_range <- seq(dmy('01-01-1990'), dmy('01-01-2017'), by = 'day')
    
    set.seed(123456)
    big_data <- data.frame(
      user_id    = sample(seq_len(round(big_size / 4)), big_size, replace = TRUE),
      start_date = sample(starting_range, big_size, replace = TRUE)
    )
    big_data$end_date <- big_data$start_date + round(runif(big_size, 1, 500))
    
    
    # The actual process to test -------------------------
    my_answer <- function(x) {
      multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)
      x %>%
        group_by(user_id) %>%
        mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
        summarise(days_held = length(unique(unlist(date_seq))))
    }
    

    在我的电脑上, my_answer 花了大约13秒钟 .

相关问题