首页 文章

在bigquery中滑动窗口聚合一年

提问于
浏览
2

我的问题是关于滑动窗口总结在bigquery .

我有一张如下表

run_id      year_week     value
  001        201451         5
  001        201452         8
  001        201501         1
  001        201505         5
  003        201352         8
  003        201401         1
  003        201405         5

每周这一周的时间范围从01到53.例如,2014年的上一周是201452,但2015年的上周是201553 . 如果它让生活更轻松我只有5年,2013年,2014年,2015年, 2016年和2017年,只有2015年有几周达到53 .

现在,对于每次运行,我试图获得值的滑动窗口总和 . 每个 year_week 将假定当前run_id(例如 001 )的下一个值_ year_week (包括其自身)的总和 . 例如,以下可以是当前表的输出

run_id    year_week   aggregate_sum
  001      201451      5+8+1+0+0
  001      201452      8+1+0+0+0
  001      201501      1+0+0+0+5
  001      201502      0+0+0+5+0
  001      201503      0+0+5+0+0
  001      201504      0+5+0+0+0
  001      201505      5+0+0+0+0
  003      201352      8+1+0+0+0
  003      201401      1+0+0+0+5
  003      201402      0+0+0+5+0
  003      201403      0+0+5+0+0
  003      201404      0+5+0+0+0
  003      201405      5+0+0+0+0

为了解释发生的情况,201451的未来5周包括其自身将是201451,201452,201501,201502,201503 . 如果当前 run_id 表中有这些周的值,我们只需将它们相加,即5 8 1 0 0,因为 year_week 的相应值如果不在表中则为0 .

是否可以使用bigquery中的滑动窗口操作来实现?

1 回答

  • 2

    以下是BigQuery Standard SQL

    #standardSQL
    WITH weeks AS (
      SELECT 100* year + week year_week
      FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year, 
      UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
    ), temp AS (
      SELECT i.run_id, w.year_week, d.year_week week2, value
      FROM  weeks w 
      CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
      LEFT JOIN `project.dataset.table` d
      USING(year_week, run_id)
    )
    SELECT * FROM (
      SELECT run_id, year_week, 
        SUM(value) OVER(win) aggregate_sum
      FROM temp
      WINDOW win AS (
        PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
      )
    )
    WHERE NOT aggregate_sum IS NULL
    

    您可以使用您问题中的虚拟数据进行上述测试/播放,如下所示

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT '001' run_id, 201451 year_week, 5 value UNION ALL
      SELECT '001', 201452, 8 UNION ALL
      SELECT '001', 201501, 1 UNION ALL
      SELECT '001', 201505, 5 
    ), weeks AS (
      SELECT 100* year + week year_week
      FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year, 
      UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
    ), temp AS (
      SELECT i.run_id, w.year_week, d.year_week week2, value
      FROM  weeks w 
      CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
      LEFT JOIN `project.dataset.table` d
      USING(year_week, run_id)
    )
    SELECT * FROM (
      SELECT run_id, year_week, 
        SUM(value) OVER(win) aggregate_sum
      FROM temp
      WINDOW win AS (
        PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
      )
    )
    WHERE NOT aggregate_sum IS NULL
    -- ORDER BY run_id, year_week
    

    结果为

    Row run_id  year_week   aggregate_sum    
    1   001     201447      5    
    2   001     201448      13   
    3   001     201449      14   
    4   001     201450      14   
    5   001     201451      14   
    6   001     201452      9    
    7   001     201501      6    
    8   001     201502      5        
    9   001     201503      5    
    10  001     201504      5    
    11  001     201505      5
    12  003     201348      8    
    13  003     201349      9    
    14  003     201350      9    
    15  003     201351      9    
    16  003     201352      9    
    17  003     201401      6    
    18  003     201402      5    
    19  003     201403      5    
    20  003     201404      5    
    21  003     201405      5
    

    注意;这是为了 I only have 5 years, 2013, 2014, 2015, 2016 and 2017 ,但可以在几周的CTE中轻松延长

相关问题