首页 文章

计算BigQuery中的移动最大值

提问于
浏览
5

给定一个带有一些排序的BigQuery表和一些数字,我想计算一个"moving maximum"的数字 - 类似于移动平均线,但是最大值 . 从Trying to calculate EMA (exponential moving average) using BigQuery看来,最好的方法是使用 LEAD() 然后自己进行聚合 . (Bigquery moving average本质上建议 CROSS JOIN ,但考虑到数据的大小,这似乎会很慢 . )

理想情况下,我可能只能从内部查询中返回一个重复的字段,而不是20个单独的字段,然后在重复字段上使用常规聚合,但我还没有找到一种方法来做到这一点,所以我我坚持滚动自己的聚合 . 虽然这对于求和或平均来说很容易,但计算最大内联是非常棘手的,我还没有想出一个好方法 .

(下面的例子当然是为了使用公共数据集而设计的 . 它们最多滚动3个元素,而我想在20左右完成 . 我已经以编程方式生成查询,所以制作它简短不是一个大问题 . )

一种方法是执行以下操作:

SELECT word,
  (CASE
    WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
    WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
    ELSE word_count_2 END
    ) AS max_count
FROM (
  SELECT word, word_count,
    LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
    LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'
)

这是O(n ^ 2),但它至少有效 . 我也可以做一个 IF 的嵌套链,如下所示:

SELECT word,
  IF(word_count >= word_count_1,
    IF(word_count >= word_count_2, word_count, word_count_2),
    IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...

这是要评估的O(n),但是查询大小在n中是指数的,所以我认为这不是一个好的选择;当然它会超过n = 20的BigQuery查询大小限制 . 我也可以做嵌套查询:

SELECT word,
  IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
  SELECT word,
    IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
  FROM ...
)

不过,似乎做20个嵌套查询可能不是一个好主意 .

有没有办法做这种查询?如果没有,我是否更正为20左右,第一个是最不好的?

3 回答

  • 7

    我用来滚动窗口的技巧:CROSS JOIN和一个数字表 . 在这种情况下,要有一个3年的移动窗口,我用数字0,1,2交叉连接 . 然后,您可以为每个组创建一个id( ending_at_year == year-i )并按此分组 .

    SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
    FROM 
    (
     SELECT mean_temp, year-i ending_at_year, year
     FROM [publicdata:samples.gsod] a
     CROSS JOIN 
      (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
     WHERE station_number=722860
    )
    GROUP BY ending_at_year
    HAVING c=3
    ORDER BY ending_at_year;
    
  • 2

    我有另一种方法来做你想要实现的事情 . 见下面的查询

    SELECT word, max(words)
    FROM 
      (SELECT word,
        word_count AS words
      FROM [publicdata:samples.shakespeare]
      WHERE corpus = 'macbeth'), 
      (SELECT word,
        LEAD(word_count, 1) OVER (ORDER BY word) AS words
      FROM [publicdata:samples.shakespeare]
      WHERE corpus = 'macbeth'), 
      (SELECT word,
        LEAD(word_count, 2) OVER (ORDER BY word) AS words
      FROM [publicdata:samples.shakespeare]
      WHERE corpus = 'macbeth')
    group by word order by word
    

    您可以尝试并将性能与您的方法进行比较(我没试过)

  • 0

    在docs here中有一个创建移动使用窗口函数的示例 .

    引用:

    以下示例计算当前行和其前一行中值的移动平均值 . 窗口框架包括与当前行一起移动的两行 .

    #legacySQL
    SELECT
      name,
      value,
      AVG(value)
        OVER (ORDER BY value
              ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
        AS MovingAverage
    FROM
      (SELECT "a" AS name, 0 AS value),
      (SELECT "b" AS name, 1 AS value),
      (SELECT "c" AS name, 2 AS value),
      (SELECT "d" AS name, 3 AS value),
      (SELECT "e" AS name, 4 AS value);
    

相关问题