SQL BigQuery中的展平事件表

时间:2018-08-31 10:26:27

标签: sql google-bigquery

我有一个events表,其中包含每个广告系列和每个人3种事件类型。这3个事件是“已接收电子邮件”,“已打开电子邮件”和“已单击电子邮件”。我希望能够获得一个人/活动的每个事件的时间戳,作为表中的新列。最好的方法是什么?

样本表数据:

campaign_id     person_id     event_type     timestamp

1               1             Received Email 2018-01-01
1               1             Opened Email   2018-02-01
1               1             Clicked Email  2018-03-01
1               2             Received Email 2018-01-01
1               2             Opened Email   2018-02-01
1               2             Opened Email   2018-02-02

示例输出:

    campaign_id     person_id     event_type     timestamp     receive_ts     open_ts     click_ts

    1               1             Received Email 2018-01-01    2018-01-01     2018-02-01  2018-03-01
    1               1             Opened Email   2018-02-01    2018-01-01     2018-02-01  2018-03-01
    1               1             Clicked Email  2018-03-01    2018-01-01     2018-02-01  2018-03-01
    1               2             Received Email 2018-01-01    2018-01-01     2018-02-01
    1               2             Opened Email   2018-02-01    2018-01-01     2018-02-01
    1               2             Opened Email   2018-02-02    2018-01-01     2018-02-01

我想到的唯一解决方案是,将表与基于campaign_id和person_id的表连接3次,对于每种事件类型均一次,但该表包含超过4亿行,因此显然效率不高。

任何建议都值得赞赏!

2 个答案:

答案 0 :(得分:3)

您可以在此处尝试使用数据透视查询。例如,如果您希望每个人/每个活动从接收电子邮件到打开该电子邮件的分钟数不同,您可以尝试以下方法:

SELECT
    campaign_id,
    person_id,
    TIMESTAMP_DIFF(
        MAX(CASE WHEN event_type = 'Opened Email' THEN timestamp END),
        MAX(CASE WHEN event_type = 'Received Email' THEN timestamp END),
        MINUTE) AS diff_in_minutes
FROM yourTable
GROUP BY
    campaign_id,
    person_id;

注意:此答案是针对原始问题的,后来又作了实质性更改。

答案 1 :(得分:1)

以下是用于BigQuery标准SQL的代码,否-您不需要执行三个JOIN-在这里甚至不需要任何JOIN

#standardSQL
SELECT campaign_id, person_id, event_type, ts,
  FIRST_VALUE(IF(event_type='Received Email', ts, NULL) IGNORE NULLS) OVER(win) receive_ts,
  FIRST_VALUE(IF(event_type='Opened Email', ts, NULL) IGNORE NULLS) OVER(win) open_ts,
  FIRST_VALUE(IF(event_type='Clicked Email', ts, NULL) IGNORE NULLS) OVER(win) click_ts
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY campaign_id, person_id ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

您可以使用问题中的伪数据作为

进行上述测试/操作
#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 campaign_id, 1 person_id, 'Received Email' event_type, '2018-01-01' ts UNION ALL
  SELECT 1, 1, 'Opened Email', '2018-02-01' UNION ALL
  SELECT 1, 1, 'Clicked Email', '2018-03-01' UNION ALL
  SELECT 1, 2, 'Received Email', '2018-01-01' UNION ALL
  SELECT 1, 2, 'Opened Email', '2018-02-01' UNION ALL
  SELECT 1, 2, 'Opened Email', '2018-02-02' 
)
SELECT campaign_id, person_id, event_type, ts,
  FIRST_VALUE(IF(event_type='Received Email', ts, NULL) IGNORE NULLS) OVER(win) receive_ts,
  FIRST_VALUE(IF(event_type='Opened Email', ts, NULL) IGNORE NULLS) OVER(win) open_ts,
  FIRST_VALUE(IF(event_type='Clicked Email', ts, NULL) IGNORE NULLS) OVER(win) click_ts
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY campaign_id, person_id ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-- ORDER BY campaign_id, person_id, ts   

结果应该是

Row campaign_id person_id   event_type      ts          receive_ts  open_ts     click_ts     
1   1           1           Received Email  2018-01-01  2018-01-01  2018-02-01  2018-03-01   
2   1           1           Opened Email    2018-02-01  2018-01-01  2018-02-01  2018-03-01   
3   1           1           Clicked Email   2018-03-01  2018-01-01  2018-02-01  2018-03-01   
4   1           2           Received Email  2018-01-01  2018-01-01  2018-02-01  null     
5   1           2           Opened Email    2018-02-01  2018-01-01  2018-02-01  null     
6   1           2           Opened Email    2018-02-02  2018-01-01  2018-02-01  null