我有一个按星期几和一天中的小时显示美元价值的ID列表(这是从时间戳开始的,所以我只做了一周的dayOfWeek和hourOfDay)
Id | dayOfWeek | hourOfDay | dollars
1 1 1 0
1 1 2 0
1 1 3 0
1 1 4 0
1 1 5 6
1 1 6 5
1 1 7 7
1 1 8 18
1 1 9 13
1 1 10 19
1 1 11 18
1 1 12 13
1 1 13 19
1 1 14 10
1 1 15 16
1 1 16 15
1 1 17 17
1 1 18 18
1 1 19 13
1 1 20 0
1 1 21 0
1 1 22 0
1 1 23 0
1 2 1 0
1 2 2 0
1 2 3 0
1 2 4 0
1 2 5 16
1 2 6 15
1 2 7 27
1 2 8 11
1 2 9 13
1 2 10 11
1 2 11 18
1 2 12 14
1 2 13 14
1 2 14 10
1 2 15 16
1 2 16 15
1 2 17 17
1 2 18 18
1 2 19 13
1 2 20 10
1 2 21 22
1 2 22 0
1 2 23 0
我想找到那些在一天结束时连续0比平均值更高的ID。我正在考虑使用类似于percent_rank()的方法来查找“高于平均值”的情况,但我无法将其与每个Id的0个案例的连续实例相结合。
任何帮助都会非常感激,但如果我没有想到正确的方法,或者我应该考虑一个不同的方向,也请告诉我。非常感谢。
答案 0 :(得分:3)
以下是BigQuery Standard SQL
#standardSQL
WITH outages AS (
SELECT
id,
MIN(dayOfWeek) AS dayOfWeek,
MIN(hourOfDay) AS hourOfDay,
COUNT(1) AS len
FROM (
SELECT
id, seq,
FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek,
FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay
FROM (
SELECT
id, dayOfWeek, hourOfDay, dollars,
COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq
FROM `yourTable`
)
WHERE dollars = 0
WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay)
)
GROUP BY id, seq
),
averages AS (
SELECT id, AVG(len) AS len
FROM outages
GROUP BY id
)
SELECT o.*
FROM outages AS o JOIN averages AS a
ON o.id = a.id AND o.len > a.len
您可以使用问题中的虚拟数据进行测试/播放,如下所示
#standardSQL
WITH yourTable AS (
SELECT * FROM UNNEST([STRUCT<id INT64, dayOfWeek INT64, hourOfDay INT64, dollars INT64>(1, 1, 1, 0),(1, 1, 2, 0),(1, 1, 3, 0),(1, 1, 4, 0),(1, 1, 5, 6),(1, 1, 6, 5),(1, 1, 7, 7),(1, 1, 8, 18),(1, 1, 9, 13),(1, 1, 10, 19),(1, 1, 11, 18),(1, 1, 12, 13),(1, 1, 13, 19),(1, 1, 14, 10),(1, 1, 15, 16),(1, 1, 16, 15),(1, 1, 17, 17),(1, 1, 18, 18),(1, 1, 19, 13),(1, 1, 20, 0),(1, 1, 21, 0),(1, 1, 22, 0),(1, 1, 23, 0),(1, 2, 0, 0),(1, 2, 1, 0),(1, 2, 2, 0),(1, 2, 3, 0),(1, 2, 4, 0),(1, 2, 5, 16),(1, 2, 6, 15),(1, 2, 7, 27),(1, 2, 8, 11),(1, 2, 9, 13),(1, 2, 10, 11),(1, 2, 11, 18),(1, 2, 12, 14),(1, 2, 13, 14),(1, 2, 14, 10),(1, 2, 15, 16),(1, 2, 16, 15),(1, 2, 17, 17),(1, 2, 18, 18),(1, 2, 19, 13),(1, 2, 20, 10),(1, 2, 21, 22),(1, 2, 22, 0),(1, 2, 23, 0)])
),
outages AS (
SELECT
id,
MIN(dayOfWeek) AS dayOfWeek,
MIN(hourOfDay) AS hourOfDay,
COUNT(1) AS len
FROM (
SELECT
id, seq,
FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek,
FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay
FROM (
SELECT
id, dayOfWeek, hourOfDay, dollars,
COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq
FROM `yourTable`
)
WHERE dollars = 0
WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay)
)
GROUP BY id, seq
),
averages AS (
SELECT id, AVG(len) AS len
FROM outages
GROUP BY id
)
SELECT o.*
FROM outages AS o JOIN averages AS a
ON o.id = a.id AND o.len > a.len
正如您在此处所见 - outages
子选择计算序列长度和序列开始的所有零序列,并在下面输出
id dayOfWeek hourOfDay len
1 1 1 4
1 1 20 9
1 2 22 2
最终SELECT仅输出来自中断的行,其中各个长度大于该id的平均长度(来自averages
子选择)
id dayOfWeek hourOfDay len
1 1 20 9