我试图匹配用户在Google BigQuery中的单个数据表中观看电视节目的开始和结束时间,但我不确定如何执行此操作因为我一直收到错误说,"表名无法解析:缺少数据集名称。"
活动表
SCHOOL-WIRELESS 100%
SCHOOL-SECURE 50%
SCHOOL-GUEST 50%
SCHOOL WIRELESS 1%
SCHOOL_WIRELESS 1%
SCHOOL-SECURE 1%
SCHOOL-GUEST 1%
bxz1872 50%
...
...
...
(so on and so forth)
期望的结果
user_id show_id event_type logtime
------- -------- ---------- -----------------------
john 123 start 2016-08-01 06:00:00 UTC
john 123 start 2016-08-01 06:15:00 UTC
john 123 end 2016-08-01 06:10:00 UTC
john 123 end 2016-08-01 06:16:00 UTC
这是我目前的查询:
user_id show_id start_time end_time
------- -------- ----------------------- -----------------------
john 123 2016-08-01 06:00:00 UTC 2016-08-01 06:10:00 UTC
john 123 2016-08-01 06:15:00 UTC 2016-08-01 06:16:00 UTC
Mikhail的答案似乎在验证了几个例子后效果最好,但......
SELECT user_id, show_id, st.logtime AS start_time, et.logtime AS end_time
FROM
(SELECT user_id, show_id, logtime FROM events WHERE event_type = 'start') AS st
JOIN
(SELECT user_id, show_id, logtime FROM events WHERE event_type = 'end') AS et
ON
st.logtime = (SELECT min(logtime) FROM events WHERE event_type = 'end')
AND st.user_id = et.user_id AND st.show_id = et.show_id
我不知道如何合并逻辑来处理相同event_type的连续实例。例如:
SELECT
user_id, show_id,
logtime AS start_time,
next_logtime AS end_time
FROM (
SELECT
user_id, show_id, event_type, logtime,
LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
FROM events
)
WHERE event_type = 'start'
AND next_event_type = 'end'
在这种情况下,我希望保持最早的开始时间,09:20,以及最早的结束时间,09:24(我认为这是有道理的......)。
答案 0 :(得分:1)
尝试以下
SELECT
user_id, show_id,
logtime AS start_time,
next_logtime AS end_time
FROM (
SELECT
user_id, show_id, event_type, logtime,
LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime
FROM events
)
WHERE event_type = 'start'
不幸的是,数据非常脏,所以有些事件可能有一个开始时间但没有结束时间,反之亦然
以下示例忽略无端开始,反之亦然 可以调整你想到的任何逻辑
SELECT
user_id, show_id,
logtime AS start_time,
next_logtime AS end_time
FROM (
SELECT
user_id, show_id, event_type, logtime,
LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
FROM events
)
WHERE event_type = 'start'
AND next_event_type = 'end'
我想保持最早的开始时间,09:20,以及最早的结束时间
SELECT
user_id, show_id,
MIN(start_time) AS start_time,
MAX(end_time) AS end_time
FROM (
SELECT
user_id, show_id,
logtime AS start_time,
next_logtime AS end_time,
SUM(event_type <> next_event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS grp
FROM (
SELECT
user_id, show_id, event_type, logtime,
LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type,
FROM events
)
WHERE event_type = 'start'
)
GROUP BY user_id, show_id, grp
答案 1 :(得分:0)
如果您的数据确实排成一行,您可以枚举开始和结束时间并将其用于聚合:
select user_id, show_id, ,
max(case when event_type = 'start' then logtime end) as logtime_start,
max(case when event_type = 'end' then logtime end) as logtime_end
from (select e.*,
row_number() over (partition by user_id, show_id, event_type orer by logtime) as seqnum
from events e
) e
group by user_id, show_id, seqnum;
这适用于您问题中的数据。如果事件已正确配对,那么它应该可以正常工作。
答案 2 :(得分:0)
SELECT user_id, show_id, st.logtime AS start_time, MIN(et.logtime) AS end_time
FROM
(SELECT user_id, show_id, time AS logtime FROM events WHERE event_type = 'start') AS st ,
(SELECT user_id, show_id, time AS logtime FROM events WHERE event_type = 'end') AS et
WHERE st.logtime < et.logtime, st.user_id = et.user_id, st.show_id = et.show_id
GROUP BY st.logtime