有没有办法更改此BigQuery自联接以使用窗口函数?

时间:2019-07-09 12:26:51

标签: sql performance google-bigquery window-functions self-join

假设我有一个BigQuery表“事件”(实际上这是一个缓慢的子查询),该表按事件类型存储每天的事件计数。事件类型很多,大多数情况下大多数情况下不会发生,因此,日期/事件类型组合只有一行,且计数非零。

我有一个查询,返回每个事件类型和日期的计数以及N天前该事件的计数,如下所示:

WITH events AS (
  SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
  UNION ALL SELECT '2019-06-09', 'a', 2
  UNION ALL SELECT '2019-06-10', 'a', 3
  UNION ALL SELECT '2019-06-07', 'b', 4
  UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2

查询速度很慢。 BigQuery best practices建议使用窗口函数而不是自联接。这里有办法吗?如果每天都有一行,那么我可以使用LAG函数,但是没有。我可以以某种方式“填充”它吗? (没有可能的事件类型的简短列表。我当然可以加入SELECT DISTINCT type FROM events,但这可能不会比自动加入更快。)

2 个答案:

答案 0 :(得分:2)

蛮力方法是:

select e.*,
       (case when lag(day) over (partition by type order by date) = dateadd(e.day, interval -2 day)
             then lag(cnt) over (partition by type order by date)
             when lag(day, 2) over (partition by type order by date) = dateadd(e.day, interval -2 day)
             then lag(cnt, 2) over (partition by type order by date)
        end) as prev_day2_count
from events e;

这可以正常工作两天。对于更长的延迟,它变得更加繁琐。

编辑:

更一般的形式使用窗框。不幸的是,这些必须是数字,所以还需要执行其他步骤:

select e.*,
       (case when min(day) over (partition by type order by diff range between 2 preceding and current day) = date_add(day, interval -2 day)
             then first_value(cnt) over (partition by type order by diff range between 2 preceding and current day)
        end)
from (select e.*,
             date_diff(day, max(day) over (partition by type), day) as diff   -- day is a bad name for a column because it is a date part
      from events e
     ) e;

然后啊! case表达式不是必需的:

select e.*,
       first_value(cnt) over (partition by type order by diff range between 2 preceding and 2 preceding)
from (select e.*,
             date_diff(day, max(day) over (partition by type), day) as diff   -- day is a bad name for a column because it is a date part
      from events e
     ) e;

答案 1 :(得分:1)

以下是用于BigQuery标准SQL

ProxyResolver

如果t适用于您问题中的样本数据-结果为:

iostat -xty 5 |
   awk '/^[0-9]{2}\/[0-9]{2}\/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}$/ {t=$0}
        /^sdb/ { print t "," $7}'