给定代表任务的每一行,具有开始时间和结束时间,如何使用窗口函数计算每个任务启动时(包括自身)运行任务的数量(即已启动和未结束) COUNT OVER
?窗口功能甚至是正确的方法吗?
示例,给定表tasks
:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
计算running_tasks
:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
答案 0 :(得分:2)
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
答案 1 :(得分:2)
您可以使用相关子查询,在这种情况下是自连接;不需要分析功能。在UI中的“显示选项”下启用standard SQL(取消选中“使用旧版SQL”)后,您可以运行此示例:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
答案 2 :(得分:2)
正如Elliott所说 - “向新用户解释分析功能通常更加困难”,甚至已建立的用户并不总是100%擅长它(尽管非常接近它)! 所以,虽然Dudu Markovitz的答案很棒 - 不幸的是,它仍然是不正确的(至少根据我理解的问题)。不正确的情况是当您在同一start_time启动多个任务时 - 这些任务有错误的“正在运行的任务”结果
作为一个例子 - 请考虑以下示例:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
我想,你会期望得到以下结果:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
如果您将尝试使用Dudu的代码 - 您将在下面找到
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
正如您可以看到任务a和错误的结果
原因是因为使用ROWS UNBOUNDED PRECEDING
代替RANGE UNBOUNDED PRECEDING
- 小但非常重要的细微差别!
所以下面的查询会给你正确的结果
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
快速摘要:
ROWS UNBOUNDED PRECEDING - 根据行的位置设置窗口框架
而
RANGE UNBOUNDED PRECEDING - 根据行值设置窗口框架
再次 - 正如Elliott所提到的 - 完全进入它比连接概念要复杂得多 - 但它值得(因为它比连接更有效) - 更多关于Window Frame Clause和ROWS vs的信息范围使用