首页 文章

验证时间序列索引

提问于
浏览
0

我正在使用一个数据集,该数据集按 dplyr 函数的 group_by 分组 . 每个组都有自己的时间索引,即假定由12个月的序列组成 . 这意味着它可以从1月开始到12月结束,或者在其他情况下可以从前一年的6月开始到明年5月结束 .

这是数据集示例:

ID       DATE
      8 2017-01-31
      8 2017-02-28
      8 2017-03-31
      8 2017-04-30
      8 2017-05-31
      8 2017-06-30
      8 2017-07-31
      8 2017-08-31
      8 2017-09-30
      8 2017-10-31
      8 2017-11-30
      8 2017-12-31
     32 2017-01-31
     32 2017-02-28
     32 2017-03-31
     32 2017-04-30
     32 2017-05-31
     32 2017-06-30
     32 2017-07-31
     32 2017-08-31
     32 2017-09-30
     32 2017-10-31
     32 2017-11-30
     32 2017-12-31
     45 2016-09-30
     45 2016-10-31
     45 2016-11-30
     45 2016-12-31
     45 2017-01-31
     45 2017-02-28
     45 2017-03-31
     45 2017-04-30
     45 2017-05-31
     45 2017-06-30
     45 2017-07-31
     45 2017-08-31

如果有所谓的"jumps",换句话说,如果日期是一致的,问题是由于数据集维度,我无法确认或验证visualy . 是否有任何简单的方法可以做到这一点,也许是 tibbletime 包中的一些修改/功能组合 .

任何帮助将受到赞赏 .

先感谢您 .

2 回答

  • 0

    以下是我通常使用 data.table 来解决此问题的方法 - 基础中的 cut.Date()seq.Date() 函数是逻辑的核心,因此如果需要,您可以使用相同的方法 dplyr .

    library(data.table)
    
    ## Convert to data.table
    setDT(df)
    
    ## Convert DATE to a date in case it wasn't already
    df[,DATE := as.Date(DATE)]
    
    ## Order by ID and Date
    setkey(df,ID,DATE)
    
    ## Create a column with the month of each date
    df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
    
    ## Generate a sequence of Dates by month for the number of observations
    ## in each group -- .N
    df[,ExpectedMonth := seq.Date(from = min(Month),
                                  by = "months",
                                  length.out = .N), by = .(ID)]
    
    ## Create a summary table to test whether an ID had 12 observations where
    ## the actual month was equal to the expected month
    Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
    
    print(Test)
    #    ID Valid
    # 1:  8  TRUE
    # 2: 32  TRUE
    # 3: 45  TRUE
    
    ## Do a no-copy join of Test to df based on ID
    ## and create a column in df based on the 'Valid' column in Test
    df[Test, Valid := i.Valid, on = "ID"]
    
    ## The final output:
    head(df)
    #    ID       DATE      Month ExpectedMonth Valid
    # 1:  8 2017-01-31 2017-01-01    2017-01-01  TRUE
    # 2:  8 2017-02-28 2017-02-01    2017-02-01  TRUE
    # 3:  8 2017-03-31 2017-03-01    2017-03-01  TRUE
    # 4:  8 2017-04-30 2017-04-01    2017-04-01  TRUE
    # 5:  8 2017-05-31 2017-05-01    2017-05-01  TRUE
    # 6:  8 2017-06-30 2017-06-01    2017-06-01  TRUE
    

    如果你真的想要使用自联接并跳过创建 Test ,你也可以更紧凑地做一些事情 .

    setDT(df)
    
    df[,DATE := as.Date(DATE)]
    setkey(df,ID,DATE)
    df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
    df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
    df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
    
  • 0

    您可以使用 dplyr 中的 summarise 函数返回每个 ID 中是否存在大于31的任何日差的逻辑值 . 您可以通过首先仅使用年份和月份构建临时日期并将"-01"附加为假日来执行此操作:

    library(dplyr)
    library(lubridate)
    
    df %>%
      group_by(ID) %>%
      mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
             DATE_diff = c(0, diff(DATE2))) %>%
      summarise(Valid = !any(DATE_diff > 31))
    

    Result:

    # A tibble: 3 x 2
         ID Valid
      <int> <lgl>
    1     8  TRUE
    2    32  TRUE
    3    45  TRUE
    

    您还可以通过绘制每个 ID 的日期来直观地检查是否存在任何间隙:

    library(ggplot2)
    
    df %>%
      mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
             ID = as.factor(ID)) %>%
      ggplot(aes(x = DATE, y = ID, group = ID)) + 
      geom_point(aes(color = ID)) +
      scale_x_date(date_breaks = "1 month",
                   date_labels = "%b-%Y") +
      labs(title = "Time Line by ID")
    

    enter image description here

相关问题