BigQuery:计算一组中按时间排序的行中的时间戳差异

时间:2018-08-17 17:49:04

标签: google-bigquery

给出这样的表格,我想在更改为其他状态之前计算每个状态的持续时间:

id state timestamp
1  1     2018-08-17 10:40:00
1  2     2018-08-17 12:40:00
1  1     2018-08-17 14:40:00
2  1     2018-08-17 09:00:00
2  2     2018-08-17 12:00:00

我想要的输出是:

id state date       duration
1  1     2018-08-17 2 hours
1  2     2018-08-17 2 hours
1  1     2018-08-17 9 hours 20 minutes (until the end of the day in this case)
2  1     2018-08-17 3 hours
2  2     2018-08-17 12 hours (until the end of the day in this case)

我不确定这在SQL中是否可行。我觉得我必须针对聚合状态和时间戳(按id分组并按ts排序)编写UDF,以输出一个结构数组(id,状态,日期和持续时间)。该数组可以展平。

1 个答案:

答案 0 :(得分:3)

以下是用于BigQuery标准SQL

#standardSQL
SELECT id, state, 
  IFNULL(
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
    24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
  ) AS duration_minutes
FROM `project.dataset.table`

您可以使用问题中的虚拟数据进行上述测试和操作:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 state, TIMESTAMP('2018-08-17 10:40:00') ts UNION ALL
  SELECT 1, 2, '2018-08-17 12:40:00' UNION ALL
  SELECT 1, 1, '2018-08-17 14:40:00' UNION ALL
  SELECT 2, 1, '2018-08-17 09:00:00' UNION ALL
  SELECT 2, 2, '2018-08-17 12:00:00' 
)
SELECT id, state, 
  IFNULL(
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
    24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
  ) AS duration_minutes
FROM `project.dataset.table`
-- ORDER BY id, ts  

结果如下

Row id  state   duration_minutes     
1   1   1        120     
2   1   2        120     
3   1   1        560     
4   2   1        180     
5   2   2        720      

如果您需要将输出格式设置为与所显示的问题完全相同,请在下面使用

#standardSQL
SELECT id, state, ts, duration_minutes,
  FORMAT('%i hours %i minutes', DIV(duration_minutes, 60), MOD(duration_minutes, 60)) duration
FROM (
  SELECT id, state, ts,
    IFNULL(
      TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
      24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
    ) AS duration_minutes
  FROM `project.dataset.table`
)

在这种情况下,您的输出将如下所示

Row id  state   ts                        duration_minutes  duration     
1   1   1       2018-08-17 10:40:00 UTC   120               2 hours 0 minutes    
2   1   2       2018-08-17 12:40:00 UTC   120               2 hours 0 minutes    
3   1   1       2018-08-17 14:40:00 UTC   560               9 hours 20 minutes   
4   2   1       2018-08-17 09:00:00 UTC   180               3 hours 0 minutes    
5   2   2       2018-08-17 12:00:00 UTC   720               12 hours 0 minutes   

当然,您很可能仍需要根据具体情况进行调整-但我认为您有一个很好的开端