窗口分析功能,其中窗口框架和顺序是不同的字段?

时间:2019-05-15 12:48:35

标签: sql google-bigquery

如何计算结束日期此记录的开始日期之前1小时的记录的平均持续时间?

我可以通过自我加入来做到这一点:

  SELECT AVG(p.duration) AS prior_duration
  FROM `bigquery-public-data`.london_bicycles.cycle_hire c
  JOIN `bigquery-public-data`.london_bicycles.cycle_hire p
  ON c.start_station_id = p.start_station_id AND
     p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND)
                  AND c.start_date

但是我如何才能更有效地做到这一点(没有自我加入)?类似于:

AVG(duration)
         OVER(PARTITION BY start_station_id
         ORDER BY UNIX_SECONDS(end_date) ASC 
         RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW) AS prior_duration

但是使用当前记录的开始日期。

2 个答案:

答案 0 :(得分:1)

更新:请参阅Mikhail的评论。这行不通。我已经更新了查询,避免BigQuery进行快速优化。

这是一个精确的解决方案。这个想法是建立一个站点上所有记录的数组,并在一个小时内使用相关查询进行过滤。 处理整个数据集花了7秒钟。

不过,工作站上的记录数组大小必须小于100 MB。根据需要对字段进行分组,以使数组足够小:)

WITH all_hires AS (
  SELECT 
    start_station_id
    , ARRAY_AGG(STRUCT(duration, 
                       start_date, 
                       TIMESTAMP_SUB(start_date, INTERVAL 1 HOUR) AS start_date_m1h, 
                       end_date)) AS hires
  FROM `bigquery-public-data`.london_bicycles.cycle_hire
  GROUP BY start_station_id
),

hires_by_ts AS (
  SELECT
    start_station_id
    , h.start_date
    , (SELECT AVG(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS duration_prev_hour
    , (SELECT COUNT(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS numreturns_prev_hour
  FROM
    all_hires, UNNEST(hires) AS h
)

SELECT * from hires_by_ts
WHERE duration_prev_hour IS NOT NULL
ORDER BY duration_prev_hour DESC
LIMIT 5

答案 1 :(得分:1)

鉴于您不能在排序和窗口框架边界中使用不同的字段-我想到的唯一方法是重复执行两次,但要注意,您可能/可能会丢失一些行,但是:

WITH cycle_hires AS (
  SELECT 
    start_station_id,
    start_date,
    ARRAY_AGG(STRUCT(end_date, duration)) OVER (
      PARTITION BY start_station_id
      ORDER BY end_date ASC
      ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
    ) AS previous
  FROM `bigquery-public-data`.london_bicycles.cycle_hire AS c
)
SELECT
  c.start_station_id,
  AVG(p.duration) AS previous_duration,
  COUNT(*) AS number_of_previous_trips_used
FROM cycle_hires AS c
  JOIN UNNEST(previous) AS p
  WHERE p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND) AND c.start_date
GROUP BY 1

使用此数据集(约2400万行),最多使用100个前排将花费约20秒的时间,而使用1000个前排将花费约120秒的时间。