计算分布在多行中的一系列事件的发生

时间:2019-05-17 13:35:38

标签: sql google-cloud-platform google-bigquery

因此,该表保存事件的时间顺序记录,如下所示:

+-------------------------+--------+--------+------------+------------+
|        Timestamp        |   id   | event  | variable 1 | variable 2 |
+-------------------------+--------+--------+------------+------------+
| 2019-05-17 00:00:00.000 | abc123 | event1 | variable1  | null       |
| 2019-05-17 00:00:10.000 | abc123 | event2 | null       | variable2  |
| 2019-05-17 00:00:15.000 | abc123 | event3 | null       | null       |
| 2019-05-17 00:05:00.000 | abc123 | event1 | variable1  | null       |
| 2019-05-17 00:05:10.000 | abc123 | event4 | null       | null       |
| 2019-05-17 00:05:15.000 | abc123 | event3 | null       | null       |
+-------------------------+--------+--------+------------+------------+

要求是计算特定事件序列发生的次数,例如event1紧跟event2,紧随event3。因此,在上面的示例中,代码将返回:

+--------+----------------+
|   id   | sequence_count |
+--------+----------------+
| abc123 |              1 |
+--------+----------------+

event1-> event2-> event3序列在用户abc123event1-> event4的数据集中出现一次-> event3序列不计算在内。用于削减计数的变量也可以切换为给出结果:

+------------+----------------+
| variable 1 | sequence_count |
+------------+----------------+
| variable1  |              1 |
+------------+----------------+

出于此查询的目的,应将timestamp变量视为序数,而不是基数。老实说,我不知道从哪里开始,如果有人可以为这种查询提供基础,我应该能够构建它以从数据中提取我想要的其他见解。

3 个答案:

答案 0 :(得分:1)

您可以使用LEAD()分析函数,如:

with
x as (
  select
    event,
    lead(event) over(order by timestamp) as next_event,
    lead(event, 2) over(order by timestamp) as next_next_event
  from t
)
select count(*)
from x
where event = 'event1'
  and next_event = 'event2'
  and next_next_event = 'event3'

已添加

我不太确定您在评论中提出的其他问题,但在我看来,您想按初始化变量分组。如果是这样,您可以这样做:

with
x as (
  select
    event,
    variable_1,
    lead(event) over(order by timestamp) as next_event,
    lead(event, 2) over(order by timestamp) as next_next_event
  from t
)
select variable_1, count(*)
from x
where event = 'event1'
  and next_event = 'event2'
  and next_next_event = 'event3'
group by variable_1

答案 1 :(得分:0)

我不知道bigquery,但是以下只是一些想法。您应该知道开始事件,例如“ event1”。

WITH cteSt (rid,id,timestamp)
AS
(
  -- Get all the timestamp for the start event
  SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp), id, timestamp 
  FROM dataset 
  WHERE event='event1' -- start event
),
cteRange(id,start_ts,end_ts)
AS
(
  -- get previous time stamp as end ts for comparing
  SELECT s.id,s.timestamp,COALESCE(e.timestamp,current_ts)
  FROM cteSt s
  LEFT JOIN cteSt e
  ON s.id=e.id
  AND s.rid+1=e.rid
),
cte_Events(id, start_ts, event_sequence)
AS
(
  -- event sequence order by ts
  SELECT r.id,r.start_ts, GROUP_CONCAT(d.event ORDER BY d.timestamp SEPARATOR ',')
  FROM cteRange r
  INNER JOIN dataset d
  ON r.id=d.id
  AND d.timestamp BETWEEN r.start_ts AND r.end_ts
  GROUP BY r.id,r.start_ts
)
-- get the occurrences for each event sequence
SELECT id,event_sequence,COUNT(*) AS occurrences
FROM cte_Events
WHERE event_sequence='YourSequence' -- or get all sequence count without where

答案 2 :(得分:0)

以下是用于BigQuery标准SQL

#standardSQL
SELECT id, 
  ARRAY_LENGTH(
    REGEXP_EXTRACT_ALL(
      CONCAT(',', STRING_AGG(event ORDER BY Timestamp)), 
      ',event1,event2,event3')
  ) AS sequence_count
FROM `project.dataset.table`
GROUP BY id