SQL(BigQuery) - 在整个时间内查找不寻常的值序列

时间:2017-05-03 20:24:45

标签: sql google-bigquery

我有一个按星期几和一天中的小时显示美元价值的ID列表(这是从时间戳开始的,所以我只做了一周的dayOfWeek和hourOfDay)

Id | dayOfWeek | hourOfDay | dollars 
1       1           1           0
1       1           2           0
1       1           3           0
1       1           4           0
1       1           5           6
1       1           6           5
1       1           7           7
1       1           8           18
1       1           9           13
1       1           10          19
1       1           11          18
1       1           12          13
1       1           13          19
1       1           14          10
1       1           15          16
1       1           16          15
1       1           17          17
1       1           18          18
1       1           19          13
1       1           20          0
1       1           21          0
1       1           22          0
1       1           23          0
1       2           1           0
1       2           2           0
1       2           3           0
1       2           4           0
1       2           5           16
1       2           6           15
1       2           7           27
1       2           8           11
1       2           9           13
1       2           10          11
1       2           11          18
1       2           12          14
1       2           13          14
1       2           14          10
1       2           15          16
1       2           16          15
1       2           17          17
1       2           18          18
1       2           19          13
1       2           20          10
1       2           21          22
1       2           22          0
1       2           23          0

我想找到那些在一天结束时连续0比平均值更高的ID。我正在考虑使用类似于percent_rank()的方法来查找“高于平均值”的情况,但我无法将其与每个Id的0个案例的连续实例相结合。

任何帮助都会非常感激,但如果我没有想到正确的方法,或者我应该考虑一个不同的方向,也请告诉我。非常感谢。

1 个答案:

答案 0 :(得分:3)

以下是BigQuery Standard SQL

   
#standardSQL
WITH outages AS (
  SELECT 
    id, 
    MIN(dayOfWeek) AS dayOfWeek,
    MIN(hourOfDay) AS hourOfDay,
    COUNT(1) AS len
  FROM (
    SELECT 
      id, seq, 
      FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek,
      FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay
    FROM (
      SELECT 
        id, dayOfWeek, hourOfDay, dollars,
        COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq   
      FROM `yourTable`
    )
    WHERE dollars = 0
    WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay)
  )
  GROUP BY id, seq
),
averages AS (
  SELECT id, AVG(len) AS len
  FROM outages
  GROUP BY id
)
SELECT o.*
FROM outages AS o JOIN averages AS a 
ON o.id = a.id AND o.len > a.len

您可以使用问题中的虚拟数据进行测试/播放,如下所示

#standardSQL
WITH yourTable AS (
  SELECT * FROM UNNEST([STRUCT<id INT64, dayOfWeek INT64, hourOfDay INT64, dollars INT64>(1, 1, 1, 0),(1, 1, 2, 0),(1, 1, 3, 0),(1, 1, 4, 0),(1, 1, 5, 6),(1, 1, 6, 5),(1, 1, 7, 7),(1, 1, 8, 18),(1, 1, 9, 13),(1, 1, 10, 19),(1, 1, 11, 18),(1, 1, 12, 13),(1, 1, 13, 19),(1, 1, 14, 10),(1, 1, 15, 16),(1, 1, 16, 15),(1, 1, 17, 17),(1, 1, 18, 18),(1, 1, 19, 13),(1, 1, 20, 0),(1, 1, 21, 0),(1, 1, 22, 0),(1, 1, 23, 0),(1, 2, 0, 0),(1, 2, 1, 0),(1, 2, 2, 0),(1, 2, 3, 0),(1, 2, 4, 0),(1, 2, 5, 16),(1, 2, 6, 15),(1, 2, 7, 27),(1, 2, 8, 11),(1, 2, 9, 13),(1, 2, 10, 11),(1, 2, 11, 18),(1, 2, 12, 14),(1, 2, 13, 14),(1, 2, 14, 10),(1, 2, 15, 16),(1, 2, 16, 15),(1, 2, 17, 17),(1, 2, 18, 18),(1, 2, 19, 13),(1, 2, 20, 10),(1, 2, 21, 22),(1, 2, 22, 0),(1, 2, 23, 0)]) 
),
outages AS (
  SELECT 
    id, 
    MIN(dayOfWeek) AS dayOfWeek,
    MIN(hourOfDay) AS hourOfDay,
    COUNT(1) AS len
  FROM (
    SELECT 
      id, seq, 
      FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek,
      FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay
    FROM (
      SELECT 
        id, dayOfWeek, hourOfDay, dollars,
        COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq 
      FROM `yourTable`
    )
    WHERE dollars = 0
    WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay)
  )
  GROUP BY id, seq
),
averages AS (
  SELECT id, AVG(len) AS len
  FROM outages
  GROUP BY id
)
SELECT o.*
FROM outages AS o JOIN averages AS a 
ON o.id = a.id AND o.len > a.len  

正如您在此处所见 - outages子选择计算序列长度和序列开始的所有零序列,并在下面输出

id  dayOfWeek   hourOfDay   len  
1   1           1           4    
1   1           20          9    
1   2           22          2    

最终SELECT仅输出来自中断的行,其中各个长度大于该id的平均长度(来自averages子选择)

id  dayOfWeek   hourOfDay   len  
1   1           20          9