我想根据ID计算我有多少个重叠间隔
WITH table AS (
SELECT 1001 as id, 1 AS start_time, 10 AS end_time UNION ALL
SELECT 1001, 2, 5 UNION ALL
SELECT 1002, 3, 4 UNION ALL
SELECT 1003, 5, 8 UNION ALL
SELECT 1003, 6, 8 UNION ALL
SELECT 1001, 6, 20
)
在这种情况下,期望的结果应该是:
2 overlapping for ID=1001
1 overlapping for ID=1003
0 overlapping for ID=1002
TOT OVERLAPPING = 3
每当有重叠(甚至部分重叠)时,我都需要这样算。
如何在BigQuery中实现这一目标?
答案 0 :(得分:2)
以下内容适用于BigQuery Standard SQL,它非常简单明了,可以自我连接并检查和计算重叠
#standardSQL
SELECT a.id,
COUNTIF(
a.start_time BETWEEN b.start_time AND b.end_time
OR a.end_time BETWEEN b.start_time AND b.end_time
OR b.start_time BETWEEN a.start_time AND a.end_time
OR b.end_time BETWEEN a.start_time AND a.end_time
) overlaps
FROM `project.dataset.table` a
LEFT JOIN `project.dataset.table` b
ON a.id = b.id AND TO_JSON_STRING(a) < TO_JSON_STRING(b)
GROUP BY id
如果要应用于问题中的样本数据-结果为
Row id overlaps
1 1001 2
2 1002 0
3 1003 1
另一个选择(为避免使用自动加入功能,而使用分析功能)
#standardSQL
SELECT id,
SUM((SELECT COUNT(1) FROM y.arr x
WHERE y.start_time BETWEEN x.start_time AND x.end_time
OR y.end_time BETWEEN x.start_time AND x.end_time
OR x.start_time BETWEEN y.start_time AND y.end_time
OR x.end_time BETWEEN y.start_time AND y.end_time
)) overlaps
FROM (
SELECT id, start_time, end_time,
ARRAY_AGG(STRUCT(start_time, end_time))
OVER(PARTITION BY id ORDER BY TO_JSON_STRING(t)
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
) arr
FROM `project.dataset.table` t
) y
GROUP BY id
显然具有与先前版本相同的结果/输出
答案 1 :(得分:0)
所有重叠的逻辑比较开始时间和结束时间:
SELECT t1.id,
COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM `project.dataset.table` t1 LEFT JOIN
`project.dataset.table` t2
ON t1.id = t2.id
GROUP BY t1.id;
这并不是您想要的,因为这会将每个间隔与其他每个间隔(包括自身)进行比较。删除“相同”的内容基本上需要一个唯一的标识符。我们可以使用row_number()
来获取。
此外,您似乎不想重复计算两次。所以:
with t as (
select t.*, row_number() over (partition by id order by start_time) as seqnum
from `project.dataset.table` t
)
SELECT t1.id,
COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM t t1 LEFT JOIN
t t2
ON t1.id = t2.id AND t1.seqnum < t2.seqnum
GROUP BY t1.id;