根据非架构特定列值序列检索未知值

时间:2019-05-31 13:13:29

标签: sql google-cloud-platform google-bigquery

我想基于时间值的相关事件值返回并对其进行操作,但前提是发生特定的事件序列。下面是一个简化的示例表:

+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
|   id   |   event1   | time1 |   event2    | time2 |   event3    | time3 |   event4    | time4 |   event5    | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent  | 10:02 | fourthevent | 10:03 | fifthevent  | 10:04 |
| abc123 | thirdevent | 10:10 | secondevent | 10:11 | thirdevent  | 10:12 | firstevent  | 10:13 | secondevent | 10:14 |
| def456 | thirdevent | 10:20 | firstevent  | 10:21 | secondevent | 10:22 | thirdevent  | 10:24 | fifthevent  | 10:25 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+

对于此表,我们希望检索以下特定事件序列发生的时间:firsteventsecondeventthirdevent,以及任何非零值的最终事件。表示返回的相关条目如下:

+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
|   id   |   event1   | time1 |   event2    | time2 |   event3    | time3 |   event4    | time4 |   event5   | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent  | 10:02 | fourthevent | 10:03 | null       | null  |
| null   | null       | null  | null        | null  | null        | null  | null        | null  | null       | null  |
| def456 | null       | null  | firstevent  | 10:21 | secondevent | 10:22 | thirdevent  | 10:24 | fifthevent | 10:26 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+

如上所示,这些列与序列的出现无关,在event1event2列中都返回了两个结果,因此解决方案应该独立并且支持n个列。然后,可以通过在3个固定变量之后的序列中发生的最终非零事件来汇总这些值,以提供如下信息:

+-------------+-------------------------------+
| FinalEvent  | AverageTimeBetweenFinalEvents |
+-------------+-------------------------------+
| fourthevent | 1:00                          |
| fifthevent  | 2:00                          |
+-------------+-------------------------------+

1 个答案:

答案 0 :(得分:0)

以下是用于BigQuery标准SQL

#standardSQL
WITH search_events AS (
  SELECT ['firstevent', 'secondevent', 'thirdevent'] search
), temp AS (
  SELECT *, REGEXP_EXTRACT(events, CONCAT(search, r',(\w*)')) FinalEvent
  FROM (
    SELECT id, [time1, time2, time3, time4, time5] times,
      (SELECT STRING_AGG(event) FROM UNNEST([event1, event2, event3, event4, event5]) event) events,
      (SELECT STRING_AGG(search) FROM UNNEST(search) search) search
    FROM `project.dataset.table`, search_events 
  )
)
SELECT FinalEvent, 
  times[SAFE_OFFSET(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(REGEXP_EXTRACT(events, CONCAT(r'(.*?)', search, ',', FinalEvent )), ',')) + 3)] time
FROM temp
WHERE IFNULL(FinalEvent, '') != ''  

如果要应用于您的问题的样本数据-结果为

Row FinalEvent  time     
1   fourthevent 10:03    
2   fifthevent  10:25    

因此,如您所见-所有最终事件及其各自的时间均被提取
现在,您可以在此处进行所需的任何分析-我不确定AverageTimeBetweenFinalEvents背后的逻辑,所以我将其留给您-特别是我认为问题的主要重点是提取那些最终事件