计算BigQuery中的移动最大值

时间:2014-07-24 00:11:29

标签: google-bigquery

给定一个带有一些排序和一些数字的BigQuery表,我想计算一个"移动最大值"数字 - 类似于移动平均线,但是最大值。从Trying to calculate EMA (exponential moving average) using BigQuery开始,似乎最好的方法是使用LEAD(),然后自己进行聚合。 (Bigquery moving average基本上建议CROSS JOIN,但考虑到数据的大小,这似乎会很慢。)

理想情况下,我可能只能从内部查询中返回单个重复字段而不是20个单独字段,然后在重复字段上使用常规聚合,但我还没有想出办法这样做,所以我不得不滚动自己的聚合。虽然这对于求和或平均来说很容易,但计算最大内联是非常棘手的,而且我还没有找到一个好方法。

(以下示例当​​然是为了使用公共数据集而设计的。它们最多可以滚动3个元素,而我希望大约20个元素。我已经生成了以编程方式查询,因此缩短查询并不是一个大问题。)

一种方法是执行以下操作:

SELECT word,
  (CASE
    WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
    WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
    ELSE word_count_2 END
    ) AS max_count
FROM (
  SELECT word, word_count,
    LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
    LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'
)

这是O(n ^ 2),但它至少有效。我也可以做一个IF s的嵌套链,如下所示:

SELECT word,
  IF(word_count >= word_count_1,
    IF(word_count >= word_count_2, word_count, word_count_2),
    IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...

这是O(n)来评估,但查询大小是n的指数,所以我不认为这是一个不错的选择;当然它会超过n = 20的BigQuery查询大小限制。我也可以做嵌套查询:

SELECT word,
  IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
  SELECT word,
    IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
  FROM ...
)

但是,似乎做20个嵌套查询在性能方面可能不是一个好主意。

有没有办法做这种查询?如果没有,我是否更正为20左右,第一个是最不好的?

3 个答案:

答案 0 :(得分:7)

我用于滚动窗口的技巧:CROSS JOIN和一个数字表。在这种情况下,要有一个3年的移动窗口,我用数字0,1,2交叉连接。然后,您可以为每个组创建一个ID(ending_at_year == year-i)并按此分组。

SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM 
(
 SELECT mean_temp, year-i ending_at_year, year
 FROM [publicdata:samples.gsod] a
 CROSS JOIN 
  (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
 WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;

答案 1 :(得分:2)

我有另一种方法去做你想要实现的事情。见下面的查询

SELECT word, max(words)
FROM 
  (SELECT word,
    word_count AS words
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'), 
  (SELECT word,
    LEAD(word_count, 1) OVER (ORDER BY word) AS words
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'), 
  (SELECT word,
    LEAD(word_count, 2) OVER (ORDER BY word) AS words
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth')
group by word order by word

您可以尝试并将性能与您的方法进行比较(我没有尝试过)

答案 2 :(得分:0)

在文档here中有一个使用窗口功能创建移动的示例。

报价:

  

以下示例计算当前行及其前一行中的值的移动平均值。窗口框架包括随当前行移动的两行。

#legacySQL
SELECT
  name,
  value,
  AVG(value)
    OVER (ORDER BY value
          ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
    AS MovingAverage
FROM
  (SELECT "a" AS name, 0 AS value),
  (SELECT "b" AS name, 1 AS value),
  (SELECT "c" AS name, 2 AS value),
  (SELECT "d" AS name, 3 AS value),
  (SELECT "e" AS name, 4 AS value);