我有一个代表消息的BigQuery表,每条消息都属于一个对话并有一个日期:
conversation date
1 2018-06-22 23:16:46.456 UTC
2 2018-06-05 00:07:12.178 UTC
1 2018-06-22 23:16:46.456 UTC
4 2018-06-05 00:07:12.178 UTC
3 2018-06-22 23:51:28.540 UTC
3 2018-06-23 00:02:59.285 UTC
4 2018-06-04 23:21:59.500 UTC
我需要获取在对话中花费的平均时间
我使用此查询来获取它:
SELECT conversation, timestamp_diff(MAX(date), MIN(date), MINUTE) minutes
FROM `Message`
GROUP BY conversation
但是由于某些对话需要几天的时间,因此当消息之间的间隔大于1小时时,必须将它们分成较小的块:
conversation date
2 2018-06-22 00:01:46.456 UTC # group 1
2 2018-06-22 00:07:12.178 UTC # group 1
2 2018-06-22 00:16:46.456 UTC # group 1
2 2018-06-22 01:07:42.178 UTC # group 1
there is a gap here
2 2018-06-22 12:51:28.540 UTC # group 2
2 2018-06-22 13:00:40.486 UTC # group 2
there is another gap here
2 2018-06-22 19:54:30.031 UTC # group 3
我认为使用解析函数可以实现: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
但是我不知道该怎么做,任何帮助都会得到真正的帮助。
答案 0 :(得分:4)
下面是BigQuery标准SQL
当消息之间的间隔大于1小时时,必须将它们切成小块:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT *
FROM conversation_groups
ORDER BY conversation, dt
结果为
Row conversation dt conversation_group
1 2 2018-06-22 00:01:46.456 UTC 0
2 2 2018-06-22 00:07:12.178 UTC 0
3 2 2018-06-22 00:16:46.456 UTC 0
4 2 2018-06-22 01:07:42.178 UTC 0
5 2 2018-06-22 12:51:28.540 UTC 1
6 2 2018-06-22 13:00:40.486 UTC 1
7 2 2018-06-22 19:54:30.031 UTC 2
我需要获取在对话中花费的平均时间
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT conversation, AVG(IF(duration = 0, NULL, duration)) avg_duration
FROM (
SELECT
conversation, conversation_group,
TIMESTAMP_DIFF(MAX(dt), MIN(dt), MINUTE) duration
FROM conversation_groups
GROUP BY conversation, conversation_group
)
GROUP BY conversation
ORDER BY conversation
结果为
Row conversation avg_duration
1 2 37.0
注意:您可以根据自己的特殊需求/愿景来调整计算平均值的逻辑-但上面的操作方式是-首先计算每组的持续时间,然后计算这些组的持续时间的平均值注意-如果持续时间为零,则将其替换为NULL,因此不会影响平均计算。持续时间以MINUTE计算,但您可以选择SECOND,也可以选择