在SQL中基于层次结构合并重叠时间间隔

时间:2019-05-30 21:19:12

标签: sql google-bigquery

我正在尝试解决一个问题,我想合并给定列ID的重叠间隔,但我也想根据层次结构/优先级合并它们。对于每个间隔,我都有start_time和stop_time,并且每个间隔都有与之关联的层次结构/优先级。

这些是表格中的以下列:

id, start_time, stop_time, priority

我能够解决我没有考虑到优先级的问题,但是我正在努力解决这一问题。

Red colour: p1 (priority 1)
Blue Colour: p2 (priority 2)
Green colour: p3 (priority 3)

请注意,在下面的示例输入中,我们将有9行具有相同的id,而输出将有6行。请注意,对于某些ID可能仅具有某些优先级值或只有一个优先级值,解决方案应予以注意。

预期的输入和输出:

expected input and output

2 个答案:

答案 0 :(得分:2)

以下是用于BigQuery标准SQL

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS time FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS time FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, time AS start_time, LEAD(time) OVER(PARTITION BY id ORDER BY time) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time, MIN(priority) priority
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, ANY_VALUE(priority) priority 
  FROM (
    SELECT id, start_time, stop_time, priority, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT id, start_time, stop_time, priority, 
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) OR
        priority != IFNULL(LAG(priority) OVER(PARTITION BY id ORDER BY start_time), -1) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

如果要应用于您的问题的样本数据-结果为

enter image description here

  

您还可以分享一个解决方案吗?在该解决方案中,我们仅基于id合并没有优先级列的间隔

我只是对查询进行了略微调整以忽略优先级

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS TIME FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time 
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time 
  FROM (
    SELECT id, start_time, stop_time, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT id, start_time, stop_time, 
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time   

有结果

Row id  start_time  stop_time    
1   1   0           36   
2   1   41          47   

答案 1 :(得分:-1)

这是一个“合并”的岛屿问题。一种解决方案是找到岛屿的起点,并对起点进行累计。您可以通过查看没有重叠的地方来确定起点:

select id, priority, min(start_time), max(stop_time)
from (select t.*,
             countif(coalesce(prev_stop_time, stop_time) < stop_time) over (partition by id, priority order by start_time) as grp
      from (select t.*,
                   max(stop_time) over (partition by id, priority order by start_time rows between unbounded preceding and 1 preceding) as prev_stop_time
            from t
           ) t
      ) t
group by id, priority, grp;