按ID BigQuery

时间:2019-01-30 22:05:12

标签: sql google-bigquery

我想根据ID计算我有多少个重叠间隔

WITH table AS (
  SELECT 1001 as id, 1 AS start_time, 10 AS end_time UNION ALL
  SELECT 1001, 2, 5 UNION ALL
  SELECT 1002, 3, 4 UNION ALL
  SELECT 1003, 5, 8 UNION ALL
  SELECT 1003, 6, 8 UNION ALL
  SELECT 1001, 6, 20 
)

enter image description here

在这种情况下,期望的结果应该是:

2 overlapping for ID=1001
1 overlapping for ID=1003
0 overlapping for ID=1002
TOT OVERLAPPING = 3

每当有重叠(甚至部分重叠)时,我都需要这样算。

如何在BigQuery中实现这一目标?

2 个答案:

答案 0 :(得分:2)

以下内容适用于BigQuery Standard SQL,它非常简单明了,可以自我连接并检查和计算重叠

#standardSQL
SELECT a.id, 
  COUNTIF(
    a.start_time BETWEEN b.start_time AND b.end_time
    OR a.end_time BETWEEN b.start_time AND b.end_time
    OR b.start_time BETWEEN a.start_time AND a.end_time
    OR b.end_time BETWEEN a.start_time AND a.end_time
  ) overlaps
FROM `project.dataset.table` a
LEFT JOIN `project.dataset.table` b
ON a.id = b.id AND TO_JSON_STRING(a) < TO_JSON_STRING(b)
GROUP BY id

如果要应用于问题中的样本数据-结果为

Row id      overlaps     
1   1001    2    
2   1002    0    
3   1003    1     

另一个选择(为避免使用自动加入功能,而使用分析功能)

#standardSQL
SELECT id,
  SUM((SELECT COUNT(1) FROM y.arr x
    WHERE y.start_time BETWEEN x.start_time AND x.end_time
    OR y.end_time BETWEEN x.start_time AND x.end_time
    OR x.start_time BETWEEN y.start_time AND y.end_time
    OR x.end_time BETWEEN y.start_time AND y.end_time
  )) overlaps     
FROM (
  SELECT id, start_time, end_time,
    ARRAY_AGG(STRUCT(start_time, end_time)) 
      OVER(PARTITION BY id ORDER BY TO_JSON_STRING(t) 
        ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
      ) arr
  FROM `project.dataset.table` t
) y
GROUP BY id

显然具有与先前版本相同的结果/输出

答案 1 :(得分:0)

所有重叠的逻辑比较开始时间和结束时间:

SELECT t1.id, 
       COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM `project.dataset.table` t1 LEFT JOIN
     `project.dataset.table` t2
     ON t1.id = t2.id 
GROUP BY t1.id;

这并不是您想要的,因为这会将每个间隔与其他每个间隔(包括自身)进行比较。删除“相同”的内容基本上需要一个唯一的标识符。我们可以使用row_number()来获取。

此外,您似乎不想重复计算两次。所以:

with t as (
      select t.*, row_number() over (partition by id order by start_time) as seqnum
      from `project.dataset.table` t
     )
SELECT t1.id, 
       COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM t t1 LEFT JOIN
     t t2
     ON t1.id = t2.id AND t1.seqnum < t2.seqnum
GROUP BY t1.id;