给定一个区间表,我可以有效地查询每个区间开始时当前打开的区间数(包括当前区间本身)吗?
例如,给出下表:
start_time end_time 1 10 2 5 3 4 5 6 7 11 19 20
我想要以下输出:
start_time count 1 1 2 2 3 3 5 3 7 2 19 1
在小型数据集上,我可以通过将数据集与自身相结合来解决这个问题:
SendMessage message;
message.set_id(0);
Value* value = message.mutable_value();
value->set_val1(1);
value->set_val2(2);
value->mutable_val3()->set_val(3);
对于大型数据集,CROSS JOIN既不实用又不必要,因为任何给定的答案仅取决于少量前面的间隔(按WITH intervals AS (
SELECT 1 AS start, 10 AS end UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
a.start_time,
count(*)
FROM
intervals a CROSS JOIN intervals b
WHERE
a.start_time >= b.start_time AND
a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time
排序时)。事实上,在我拥有的数据集上,它会超时。有没有更好的方法来实现这一目标?
答案 0 :(得分:2)
...... CROSS JOIN既不切实际又不必要...... 有没有更好的方法来实现这一目标?
在下面尝试BigQuery Standard SQL。没有涉及JOIN
#standardSQL
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
您可以使用以下示例使用您的问题中的虚拟数据进行测试/播放
#standardSQL
WITH intervals AS (
SELECT 1 AS start_time, 10 AS end_time UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time