如何计算结束日期在此记录的开始日期之前1小时的记录的平均持续时间?
我可以通过自我加入来做到这一点:
SELECT AVG(p.duration) AS prior_duration
FROM `bigquery-public-data`.london_bicycles.cycle_hire c
JOIN `bigquery-public-data`.london_bicycles.cycle_hire p
ON c.start_station_id = p.start_station_id AND
p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND)
AND c.start_date
但是我如何才能更有效地做到这一点(没有自我加入)?类似于:
AVG(duration)
OVER(PARTITION BY start_station_id
ORDER BY UNIX_SECONDS(end_date) ASC
RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW) AS prior_duration
但是使用当前记录的开始日期。
答案 0 :(得分:1)
更新:请参阅Mikhail的评论。这行不通。我已经更新了查询,避免BigQuery进行快速优化。
这是一个精确的解决方案。这个想法是建立一个站点上所有记录的数组,并在一个小时内使用相关查询进行过滤。 处理整个数据集花了7秒钟。
不过,工作站上的记录数组大小必须小于100 MB。根据需要对字段进行分组,以使数组足够小:)
WITH all_hires AS (
SELECT
start_station_id
, ARRAY_AGG(STRUCT(duration,
start_date,
TIMESTAMP_SUB(start_date, INTERVAL 1 HOUR) AS start_date_m1h,
end_date)) AS hires
FROM `bigquery-public-data`.london_bicycles.cycle_hire
GROUP BY start_station_id
),
hires_by_ts AS (
SELECT
start_station_id
, h.start_date
, (SELECT AVG(duration) FROM UNNEST(hires)
WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
AS duration_prev_hour
, (SELECT COUNT(duration) FROM UNNEST(hires)
WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
AS numreturns_prev_hour
FROM
all_hires, UNNEST(hires) AS h
)
SELECT * from hires_by_ts
WHERE duration_prev_hour IS NOT NULL
ORDER BY duration_prev_hour DESC
LIMIT 5
答案 1 :(得分:1)
鉴于您不能在排序和窗口框架边界中使用不同的字段-我想到的唯一方法是重复执行两次,但要注意,您可能/可能会丢失一些行,但是:
WITH cycle_hires AS (
SELECT
start_station_id,
start_date,
ARRAY_AGG(STRUCT(end_date, duration)) OVER (
PARTITION BY start_station_id
ORDER BY end_date ASC
ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
) AS previous
FROM `bigquery-public-data`.london_bicycles.cycle_hire AS c
)
SELECT
c.start_station_id,
AVG(p.duration) AS previous_duration,
COUNT(*) AS number_of_previous_trips_used
FROM cycle_hires AS c
JOIN UNNEST(previous) AS p
WHERE p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND) AND c.start_date
GROUP BY 1
使用此数据集(约2400万行),最多使用100个前排将花费约20秒的时间,而使用1000个前排将花费约120秒的时间。