COUNT()OVER以CURRENT ROW为条件

时间:2016-11-20 19:39:59

标签: sql google-bigquery window-functions

给定代表任务的每一行,具有开始时间和结束时间,如何使用窗口函数计算每个任务启动时(包括自身)运行任务的数量(即已启动和未结束) COUNT OVER?窗口功能甚至是正确的方法吗?

示例,给定表tasks

task_id  start_time  end_time
   a         1          10
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

计算running_tasks

task_id  start_time  end_time  running_tasks
   a         1          10           1         # a
   b         2           5           2         # a,b
   c         5          15           2         # a,c (b has ended)
   d         8          13           3         # a,c,d
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

3 个答案:

答案 0 :(得分:2)

select      task_id,start_time,end_time,running_tasks 

from       (select      task_id,tm,op,start_time,end_time

                       ,sum(op) over 
                        (
                            order by    tm,op 
                            rows        unbounded preceding
                        ) as running_tasks 

            from       (select      task_id,start_time as tm,1 as op,start_time,end_time 
                        from        tasks 

                        union   all 

                        select      task_id,end_time as tm,-1 as op,start_time,end_time 
                        from        tasks 
                        ) t 
            )t 

where       op = 1
;

答案 1 :(得分:2)

您可以使用相关子查询,在这种情况下是自连接;不需要分析功能。在UI中的“显示选项”下启用standard SQL(取消选中“使用旧版SQL”)后,您可以运行此示例:

WITH tasks AS (
  SELECT
    task_id,
    start_time,
    end_time
  FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
    ('a', 1, 10),
    ('b', 2, 5),
    ('c', 5, 15),
    ('d', 8, 13),
    ('e', 12, 20),
    ('f', 21, 30)
  ])
)
SELECT
  *,
  (SELECT COUNT(*) FROM tasks t2
   WHERE t.start_time >= t2.start_time AND
   t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;

答案 2 :(得分:2)

正如Elliott所说 - “向新用户解释分析功能通常更加困难”,甚至已建立的用户并不总是100%擅长它(尽管非常接近它)! 所以,虽然Dudu Markovitz的答案很棒 - 不幸的是,它仍然是不正确的(至少根据我理解的问题)。不正确的情况是当您在同一start_time启动多个任务时 - 这些任务有错误的“正在运行的任务”结果

作为一个例子 - 请考虑以下示例:

task_id  start_time  end_time
   a         1          10
   aa        1           2
   aaa       1           8
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

我想,你会期望得到以下结果:

task_id  start_time  end_time  running_tasks
   a         1          10           3         # a,aa,aaa
   aa        1           2           3         # a,aa,aaa
   aaa       1           8           3         # a,aa,aaa
   b         2           5           3         # a,aaa,b (aa has ended)
   c         5          15           3         # a,aaa,c (b has ended)
   d         8          13           3         # a,c,d (aaa has ended)
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)     

如果您将尝试使用Dudu的代码 - 您将在下面找到

task_id  start_time  end_time  running_tasks
   a         1          10           1        
   aa        1           2           2        
   aaa       1           8           3        
   b         2           5           3        
   c         5          15           3        
   d         8          13           3        
   e        12          20           3        
   f        21          30           1        

正如您可以看到任务a和错误的结果 原因是因为使用ROWS UNBOUNDED PRECEDING代替RANGE UNBOUNDED PRECEDING - 小但非常重要的细微差别!

所以下面的查询会给你正确的结果

SELECT  task_id,start_time,end_time,running_tasks 
FROM  (
  SELECT  
    task_id, tm, op, start_time, end_time,
    SUM(op) OVER (ORDER BY  tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks 
  FROM  (
    SELECT  
      task_id, start_time AS tm, 1 AS op, start_time, end_time 
    FROM  tasks UNION  ALL 
    SELECT  
      task_id, end_time AS tm, -1 AS op, start_time, end_time 
    FROM  tasks 
  ) t 
)t 
WHERE  op = 1
ORDER BY start_time       

快速摘要:
ROWS UNBOUNDED PRECEDING - 根据行的位置设置窗口框架 而
RANGE UNBOUNDED PRECEDING - 根据行值设置窗口框架

再次 - 正如Elliott所提到的 - 完全进入它比连接概念要复杂得多 - 但它值得(因为它比连接更有效) - 更多关于Window Frame Clause和ROWS vs的信息范围使用