因此,该表保存事件的时间顺序记录,如下所示:
+-------------------------+--------+--------+------------+------------+
| Timestamp | id | event | variable 1 | variable 2 |
+-------------------------+--------+--------+------------+------------+
| 2019-05-17 00:00:00.000 | abc123 | event1 | variable1 | null |
| 2019-05-17 00:00:10.000 | abc123 | event2 | null | variable2 |
| 2019-05-17 00:00:15.000 | abc123 | event3 | null | null |
| 2019-05-17 00:05:00.000 | abc123 | event1 | variable1 | null |
| 2019-05-17 00:05:10.000 | abc123 | event4 | null | null |
| 2019-05-17 00:05:15.000 | abc123 | event3 | null | null |
+-------------------------+--------+--------+------------+------------+
要求是计算特定事件序列发生的次数,例如event1
紧跟event2
,紧随event3
。因此,在上面的示例中,代码将返回:
+--------+----------------+
| id | sequence_count |
+--------+----------------+
| abc123 | 1 |
+--------+----------------+
event1
-> event2
-> event3
序列在用户abc123
,event1
-> event4
的数据集中出现一次-> event3
序列不计算在内。用于削减计数的变量也可以切换为给出结果:
+------------+----------------+
| variable 1 | sequence_count |
+------------+----------------+
| variable1 | 1 |
+------------+----------------+
出于此查询的目的,应将timestamp变量视为序数,而不是基数。老实说,我不知道从哪里开始,如果有人可以为这种查询提供基础,我应该能够构建它以从数据中提取我想要的其他见解。
答案 0 :(得分:1)
您可以使用LEAD()
分析函数,如:
with
x as (
select
event,
lead(event) over(order by timestamp) as next_event,
lead(event, 2) over(order by timestamp) as next_next_event
from t
)
select count(*)
from x
where event = 'event1'
and next_event = 'event2'
and next_next_event = 'event3'
已添加:
我不太确定您在评论中提出的其他问题,但在我看来,您想按初始化变量分组。如果是这样,您可以这样做:
with
x as (
select
event,
variable_1,
lead(event) over(order by timestamp) as next_event,
lead(event, 2) over(order by timestamp) as next_next_event
from t
)
select variable_1, count(*)
from x
where event = 'event1'
and next_event = 'event2'
and next_next_event = 'event3'
group by variable_1
答案 1 :(得分:0)
我不知道bigquery,但是以下只是一些想法。您应该知道开始事件,例如“ event1”。
WITH cteSt (rid,id,timestamp)
AS
(
-- Get all the timestamp for the start event
SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp), id, timestamp
FROM dataset
WHERE event='event1' -- start event
),
cteRange(id,start_ts,end_ts)
AS
(
-- get previous time stamp as end ts for comparing
SELECT s.id,s.timestamp,COALESCE(e.timestamp,current_ts)
FROM cteSt s
LEFT JOIN cteSt e
ON s.id=e.id
AND s.rid+1=e.rid
),
cte_Events(id, start_ts, event_sequence)
AS
(
-- event sequence order by ts
SELECT r.id,r.start_ts, GROUP_CONCAT(d.event ORDER BY d.timestamp SEPARATOR ',')
FROM cteRange r
INNER JOIN dataset d
ON r.id=d.id
AND d.timestamp BETWEEN r.start_ts AND r.end_ts
GROUP BY r.id,r.start_ts
)
-- get the occurrences for each event sequence
SELECT id,event_sequence,COUNT(*) AS occurrences
FROM cte_Events
WHERE event_sequence='YourSequence' -- or get all sequence count without where
答案 2 :(得分:0)
以下是用于BigQuery标准SQL
#standardSQL
SELECT id,
ARRAY_LENGTH(
REGEXP_EXTRACT_ALL(
CONCAT(',', STRING_AGG(event ORDER BY Timestamp)),
',event1,event2,event3')
) AS sequence_count
FROM `project.dataset.table`
GROUP BY id