如何从同一个表中找到最接近的匹配时间?

时间:2016-08-11 16:44:48

标签: sql google-bigquery

我试图匹配用户在Google BigQuery中的单个数据表中观看电视节目的开始和结束时间,但我不确定如何执行此操作因为我一直收到错误说,"表名无法解析:缺少数据集名称。"

活动表

SCHOOL-WIRELESS        100%
SCHOOL-SECURE        50%
SCHOOL-GUEST        50%
SCHOOL WIRELESS        1%
SCHOOL_WIRELESS        1%
SCHOOL-SECURE        1%
SCHOOL-GUEST        1%
bxz1872         50%
...
...
...
(so on and so forth)

期望的结果

user_id  show_id   event_type  logtime
-------  --------  ----------  -----------------------
 john      123       start     2016-08-01 06:00:00 UTC
 john      123       start     2016-08-01 06:15:00 UTC
 john      123       end       2016-08-01 06:10:00 UTC
 john      123       end       2016-08-01 06:16:00 UTC

这是我目前的查询:

user_id  show_id   start_time                end_time
-------  --------  -----------------------   -----------------------
 john      123     2016-08-01 06:00:00 UTC   2016-08-01 06:10:00 UTC
 john      123     2016-08-01 06:15:00 UTC   2016-08-01 06:16:00 UTC

Mikhail的答案似乎在验证了几个例子后效果最好,但......

SELECT user_id, show_id, st.logtime AS start_time, et.logtime AS end_time
  FROM 
    (SELECT user_id, show_id, logtime FROM events WHERE event_type = 'start') AS st 
  JOIN 
    (SELECT user_id, show_id, logtime FROM events WHERE event_type = 'end') AS et 
  ON 
    st.logtime = (SELECT min(logtime) FROM events WHERE event_type = 'end') 
      AND st.user_id = et.user_id AND st.show_id = et.show_id

我不知道如何合并逻辑来处理相同event_type的连续实例。例如:

SELECT 
  user_id, show_id,  
  logtime AS start_time,
  next_logtime AS end_time
FROM (
  SELECT 
    user_id, show_id, event_type, logtime,
    LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
    LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
  FROM events 
)
WHERE event_type = 'start'
AND next_event_type = 'end'

在这种情况下,我希望保持最早的开始时间,09:20,以及最早的结束时间,09:24(我认为这是有道理的......)。

3 个答案:

答案 0 :(得分:1)

尝试以下

SELECT 
  user_id, show_id,  
  logtime AS start_time,
  next_logtime AS end_time
FROM (
  SELECT 
    user_id, show_id, event_type, logtime,
    LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime
  FROM events 
)
WHERE event_type = 'start'  
  

不幸的是,数据非常脏,所以有些事件可能有一个开始时间但没有结束时间,反之亦然

以下示例忽略无端开始,反之亦然 可以调整你想到的任何逻辑

SELECT 
  user_id, show_id,  
  logtime AS start_time,
  next_logtime AS end_time
FROM (
  SELECT 
    user_id, show_id, event_type, logtime,
    LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
    LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
  FROM events 
)
WHERE event_type = 'start'
AND next_event_type = 'end' 
  

我想保持最早的开始时间,09:20,以及最早的结束时间

SELECT 
  user_id, show_id, 
  MIN(start_time) AS start_time,
  MAX(end_time) AS end_time
FROM (
  SELECT 
    user_id, show_id,  
    logtime AS start_time,
    next_logtime AS end_time,
    SUM(event_type <> next_event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING  ) AS grp
  FROM (
    SELECT 
      user_id, show_id, event_type, logtime,
      LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
      LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type,
    FROM events 
  )
  WHERE event_type = 'start'
)
GROUP BY user_id, show_id, grp

答案 1 :(得分:0)

如果您的数据确实排成一行,您可以枚举开始和结束时间并将其用于聚合:

select user_id, show_id, ,
       max(case when event_type = 'start' then logtime end) as logtime_start,
       max(case when event_type = 'end' then logtime end) as logtime_end
from (select e.*,
             row_number() over (partition by user_id, show_id, event_type orer by logtime) as seqnum
      from events e
     ) e
group by user_id, show_id, seqnum;

这适用于您问题中的数据。如果事件已正确配对,那么它应该可以正常工作。

答案 2 :(得分:0)

SELECT user_id, show_id, st.logtime AS start_time, MIN(et.logtime) AS end_time
FROM 
(SELECT user_id, show_id, time AS logtime FROM events WHERE event_type = 'start') AS st , 
(SELECT user_id, show_id, time AS logtime  FROM events WHERE event_type = 'end') AS et 
WHERE st.logtime < et.logtime, st.user_id = et.user_id, st.show_id = et.show_id
GROUP BY st.logtime 
  1. 这将生成两个选择查询的笛卡尔积 启动logtime并结束logtime。
  2. 过滤start logtime&lt;结束日志时间。
  3. 将具有相同开始日志时间和最终日志时间最小值的行分组。