记录设备的每个故障。每个条目都包含一个customer_id,device_id和时间戳:
+-------------+-----------+-----------------------+
| customer_id | device_id | timestamp |
+-------------+-----------+-----------------------+
| 1 | 1 | 2019-02-12T01:00:00 |
| 2 | 2 | 2019-02-12T01:00:00 |
| 1 | 1 | 2019-02-12T02:00:00 |
| 1 | 1 | 2019-02-12T03:00:00 |
+-------------+-----------+-----------------------+
故障日志每小时收集一次。我对以下信息感兴趣:
设备可能有多个小时的故障,这可能表示硬件故障。另一方面,如果某个设备的故障不跨越多个小时,则可能是该设备的错误使用。
结果应如下所示:
+-------------+-----------+---------------------+-----------------+------------+-----------------------+
| customer_id | device_id | total | consecutive | non consecutive | day | last_recording |
+-----+-------------------+-------+-------------+-----------------+------------------------------------+
| 1 | 1 | 3 | 1 | 2 | 2019-02-12 | 2019-02-12T03:00:00 |
| 2 | 2 | 1 | 0 | 1 | 2019-02-12 | 2019-02-12T01:00:00 |
+-------------+-----------+-------+-------------+-----------------+------------+-----------------------+
在上面的示例中,设备1在2019-02-12T02:00:00报告了一个故障,被认为是“非连续的”,此后不久,又在2019-02-12T03:00:00发生了另一个故障,被认为是“连续”。
我想创建一个查询,生成该结果。我尝试过的
SELECT customer_id, device_id, COUNT(customer_id) AS count, FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp)) as day
FROM `malfunctions`
GROUP BY day, customer_id, device_id
通过这种方式,我可以按客户每天获得的总故障数量。我想我必须使用LEAD运算符来获取(非)连续计数,但是我不确定如何。有任何想法吗?结果应按天“滚动”。
答案 0 :(得分:1)
以下是用于BigQuery标准SQL
#standardSQL
SELECT customer_id, device_id, day, SUM(batch_count) total,
SUM(batch_count) - COUNTIF(batch_count = 1) consecutive,
COUNTIF(batch_count = 1) non_consecutive,
ARRAY_AGG(STRUCT(batch AS batch, batch_count AS batch_count, first_recording AS first_recording, last_recording AS last_recording)) details
FROM (
SELECT customer_id, device_id, day, batch,
COUNT(1) batch_count,
MIN(ts) first_recording,
MAX(ts) last_recording
FROM (
SELECT customer_id, device_id, ts, day,
COUNTIF(gap) OVER(PARTITION BY customer_id, device_id, day ORDER BY ts) batch
FROM (
SELECT customer_id, device_id, ts, DATE(ts) day,
IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY customer_id, device_id, DATE(ts) ORDER BY ts), HOUR), 777) > 1 gap
FROM `project.dataset.malfunctions`
)
)
GROUP BY customer_id, device_id, day, batch
)
GROUP BY customer_id, device_id, day
您可以使用下面的示例中的虚拟数据来测试,玩游戏
#standardSQL
WITH `project.dataset.malfunctions` AS (
SELECT 1 customer_id, 1 device_id, TIMESTAMP '2019-02-12T01:00:00' ts UNION ALL
SELECT 1, 1, '2019-02-12T02:00:00' UNION ALL
SELECT 1, 1, '2019-02-12T03:00:00' UNION ALL
SELECT 1, 1, '2019-02-12T04:00:00' UNION ALL
SELECT 1, 1, '2019-02-12T09:00:00' UNION ALL
SELECT 1, 1, '2019-02-12T10:00:00' UNION ALL
SELECT 1, 1, '2019-02-13T03:00:00' UNION ALL
SELECT 2, 2, '2019-02-12T01:00:00'
)
SELECT customer_id, device_id, day, SUM(batch_count) total,
SUM(batch_count) - COUNTIF(batch_count = 1) consecutive,
COUNTIF(batch_count = 1) non_consecutive,
ARRAY_AGG(STRUCT(batch AS batch, batch_count AS batch_count, first_recording AS first_recording, last_recording AS last_recording)) details
FROM (
SELECT customer_id, device_id, day, batch,
COUNT(1) batch_count,
MIN(ts) first_recording,
MAX(ts) last_recording
FROM (
SELECT customer_id, device_id, ts, day,
COUNTIF(gap) OVER(PARTITION BY customer_id, device_id, day ORDER BY ts) batch
FROM (
SELECT customer_id, device_id, ts, DATE(ts) day,
IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY customer_id, device_id, DATE(ts) ORDER BY ts), HOUR), 777) > 1 gap
FROM `project.dataset.malfunctions`
)
)
GROUP BY customer_id, device_id, day, batch
)
GROUP BY customer_id, device_id, day
-- ORDER BY customer_id, device_id, day
有结果