BigQuery中重叠间隔的计数

时间:2017-04-10 15:28:42

标签: sql google-bigquery

给定一个区间表,我可以有效地查询每个区间开始时当前打开的区间数(包括当前区间本身)吗?

例如,给出下表:

start_time end_time
         1       10
         2        5
         3        4
         5        6
         7       11
        19       20

我想要以下输出:

start_time count
         1     1
         2     2
         3     3
         5     3
         7     2
        19     1

在小型数据集上,我可以通过将数据集与自身相结合来解决这个问题:

SendMessage message;
message.set_id(0);
Value* value = message.mutable_value();
value->set_val1(1);
value->set_val2(2);
value->mutable_val3()->set_val(3);

对于大型数据集,CROSS JOIN既不实用又不必要,因为任何给定的答案仅取决于少量前面的间隔(按WITH intervals AS ( SELECT 1 AS start, 10 AS end UNION ALL SELECT 2, 5 UNION ALL SELECT 3, 4 UNION ALL SELECT 5, 6 UNION ALL SELECT 7, 11 UNION ALL SELECT 19, 20 ) SELECT a.start_time, count(*) FROM intervals a CROSS JOIN intervals b WHERE a.start_time >= b.start_time AND a.start_time <= b.end_time GROUP BY a.start_time ORDER BY a.start_time 排序时)。事实上,在我拥有的数据集上,它会超时。有没有更好的方法来实现这一目标?

1 个答案:

答案 0 :(得分:2)

  

...... CROSS JOIN既不切实际又不必要......   有没有更好的方法来实现这一目标?

在下面尝试BigQuery Standard SQL。没有涉及JOIN

  
#standardSQL
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time  

您可以使用以下示例使用您的问题中的虚拟数据进行测试/播放

#standardSQL
WITH intervals AS (
  SELECT 1 AS start_time, 10 AS end_time UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20 
)
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time