首页 文章

Google BigQuery标准SQL - 销售滑动窗口

提问于
浏览
0

我想知道如何最好地在BigQuery standardSQL (不是遗产)中做一个翻滚和滑动窗口 .

Apache Spark让这很容易 - 对于一个长度为10分钟的滑动窗口,每5分钟滑动一次,groupBy(窗口(“10分钟”,“5分钟”))和翻滚窗口组(窗口(“10分钟”)) ) .

考虑我有一个简化的订单行:

orderId, 
orderPlacedTimestamp, 
orderTotals.grandTotalNet, 
orderTotals.grandTotalGross

(注意总数上的结构)

我需要两个总的订单总销售额:

  • 每时每刻都在翻滚销售窗口

  • 滑动10分钟/ 5分钟窗口(如上所述) .

我开始使用SQL over 但是在分区,时间戳和获得正确的开始和结束窗口方面有点混乱 . 对于每个结果集应该是:

windowStartTime, 
windowEndTime, 
windowTotalAmount

因此窗口具有开始和结束时间(这对于滑动是至关重要的,因为窗口10的长度与滑动持续时间5不同,因此给定的行可以在多个窗口中作为结果) .

我如何在BigQuery中执行此操作?

使用当前SQL更新11/06:

SELECT 
  TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) as startWindowTime,
  TIMESTAMP_SECONDS(LAST_VALUE(ts_5min*5*60) OVER(w)) as endWindowTime,
  SUM(orderTotalNetConverted) OVER(ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) as windowSalesTotal
FROM (
  SELECT 
    CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
      AS ts_5min,
    orderTotalNetConverted
  FROM orders
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY startWindowTime desc

2 回答

  • 1

    翻滚窗口更容易做,它只是每小时一次的常规 GROUP BY

    SELECT 
      TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR), 
      SUM(orderTotals.grandTotalNet)
    FROM T
    GROUP BY 1
    

    对于滑动窗口,我首先使用以下内容将时间戳标准化为5分钟间隔:

    • TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE) 进入分界线

    • UNIX_SECONDS 从纪元转换为秒

    • 除以60得到分钟

    • 除以5得到5分钟的间隔

    • 舍入为整数:

    CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)

    现在你可以使用标准的 OVER() 子句获得10分钟的窗口,这意味着每次有2个这样的间隔,并且为了获得开始时间,使用 FIRST_VALUE 解析函数:

    SELECT
      orderId,
      TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
      TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
                    INTERVAL 10 MINUTE) endWindowTime,
      SUM(grandTotalNet) OVER(w)
    FROM (
      SELECT 
      *,
      CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64) 
        AS ts_5min
      FROM t
    )
    WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
    
  • 1

    以下是 sliding 10 MINUTE / 5 MINUTE window

    #standardSQL
    WITH starts AS (
      SELECT TIMESTAMP_ADD(TIMESTAMP_TRUNC(first, HOUR), INTERVAL step MINUTE) AS start
      FROM 
        (SELECT MIN(orderPlacedTimestamp) AS first, MAX(orderPlacedTimestamp) AS last FROM YourTable), 
        UNNEST(GENERATE_ARRAY(0, TIMESTAMP_DIFF(last, TIMESTAMP_TRUNC(first, HOUR), MINUTE) , 5)) AS step
    )
    SELECT 
      start, 
      SUM(orderTotals.grandTotalNet) AS net, 
      SUM(orderTotals.grandTotalGross) AS gross
    FROM starts AS s JOIN YourTable AS t
    ON t.orderPlacedTimestamp BETWEEN s.start AND TIMESTAMP_ADD(start, INTERVAL 10 MINUTE)
    GROUP BY start
    ORDER BY start
    

    正如Mosha在他的回答中提到的 - 翻滚HOURLY窗口很容易 - 但如果你需要不同的窗口 - 上面的方法更灵活(我认为)并且易于调整任何窗口大小,通过更改查询代码中的相应 510 来键入
    但总的来说 - 想提一下 - 使用分析函数比加入更优化

相关问题