我正在尝试解决一个问题,其中我想合并给定列ID的重叠间隔,但我也想跟踪每个重叠间隔的最大值。每个间隔都有start_time和stop_time,每个间隔都有一个与之关联的层次结构/优先级。
这些是表格中的以下列: id,start_time,stop_time,some_value
示例输入:
示例输出:
答案 0 :(得分:1)
以下是用于BigQuery标准SQL的代码,我假设您仍在处理与上一个问题相同的用例,因此我想使其与该解决方案保持一致-可以在需要考虑的情况下对其进行扩展优先级
所以,无论如何:
#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value
FROM (
SELECT id, start_time, stop_time, some_value, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time, some_value,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
如果要应用于您的样本数据-结果为
Row id start_time stop_time some_value
1 1 0 36 50
2 1 41 47 23
是否可以在结果中再增加一列,以显示该时间段内的事件数
#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value, ANY_VALUE(To_JSON_STRING(b)) event_hash
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value, COUNT(DISTINCT event_hash) events
FROM (
SELECT *, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT *,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
有结果
Row id start_time stop_time some_value events
1 1 0 36 50 8
2 1 41 47 23 1
答案 1 :(得分:0)
您可以使用累积的max()
确定何时开始新分组。然后用累积条件count()
来识别组。 。 。最后聚合:
select min(start_time), max(stop_time), max(some_value)
from (select t.*,
countif(prev_stop_time is null or prev_stop_time < start_time) over (partition by id order by start_time) as grp
from (select t.*,
max(stop_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_stop_time
from t
) t
) t
group by item_id, grp;