如何在回溯期内返回多个字段唯一的计数?

时间:2018-09-30 18:47:18

标签: sql google-bigquery

这是我拥有的数据集(约10 TB)的示例

+----+------------+----------+----------------+--------------+
| id | date       | campaign | campaign_start | campaign_end |
+----+------------+----------+----------------+--------------+
| 1  | 2018-01-01 | 1        | 2018-01-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+
| 1  | 2018-02-01 | 2        | 2018-02-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+
| 1  | 2018-02-02 | 2        | 2018-02-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+
| 1  | 2018-02-03 | 2        | 2018-02-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+
| 2  | 2018-01-23 | 1        | 2018-01-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+
| 2  | 2018-02-03 | 2        | 2018-02-01     | 2018-02-03   |
+----+------------+----------+----------------+--------------+

我要:

对于每个唯一的ID +广告系列:

  1. 获取特定广告系列在一段时间内ID出现的频率
  2. 获取在广告系列开始前的可变回溯期内(例如3个月)内ID出现的频率。说“> = campaign_start + 3个月”
  3. 在该窗口中获取最早(第一)和最新(最后)日期

我想要的输出是:

+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+
| id | campaign | campaign_frequency | total_lookback_frequency | campaign_start | campaign_end | first_date | last_date  |
+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+
| 1  | 1        | 1                  | 1                        | 2018-01-01     | 2018-02-03   | 2018-01-01 | 2018-01-01 |
+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+
| 1  | 2        | 3                  | 4                        | 2018-02-01     | 2018-02-03   | 2018-01-01 | 2018-02-03 |
+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+
| 2  | 1        | 1                  | 1                        | 2018-01-01     | 2018-02-03   | 2018-01-23 | 2018-01-23 |
+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+
| 2  | 2        | 1                  | 2                        | 2018-02-01     | 2018-02-03   | 2018-01-23 | 2018-02-03 |
+----+----------+--------------------+--------------------------+----------------+--------------+------------+------------+

我一直遇到的问题是我无法使total_lookback_frequency正常工作,它总是返回与campaign_frequency(这只是id(广告系列)的count(id)组)相同的结果。

以下是我所拥有的(不起作用):

SELECT  
  id,
  campaign,
  min(date) as first_date,
  max(date) as end_date,
  count(id) as total_lookback_frequency,
WHERE
  date >= sub(date, INTERVAL 730 hour)
GROUP BY
  id,
  campaign,
  date

您能在这里帮忙吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
SELECT 
  id,
  campaign,
  COUNT(1) campaign_frequency,
  (
    SELECT COUNT(1) 
    FROM `project.dataset.table` 
    WHERE id = t.id
    AND dt BETWEEN  DATE_SUB(t.campaign_start, INTERVAL 3 MONTH) AND DATE_SUB(t.campaign_start, INTERVAL 1 DAY)
  ) total_lookback_frequency,
  campaign_start,
  campaign_end,
  MIN(dt) AS first_date,
  MAX(dt) AS end_date
FROM `project.dataset.table` t
GROUP BY id, campaign, campaign_start, campaign_end

您可以使用下面的问题中的虚拟数据进行测试,操作

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, DATE '2018-01-01' dt, 1 campaign, DATE '2018-01-01' campaign_start, DATE '2018-02-03' campaign_end UNION ALL
  SELECT 1, '2018-02-01', 2, '2018-02-01', '2018-02-03' UNION ALL
  SELECT 1, '2018-02-02', 2, '2018-02-01', '2018-02-03' UNION ALL
  SELECT 1, '2018-02-03', 2, '2018-02-01', '2018-02-03' UNION ALL
  SELECT 2, '2018-01-23', 1, '2018-01-01', '2018-02-03' UNION ALL
  SELECT 2, '2018-02-03', 2, '2018-02-01', '2018-02-03' 
)
SELECT 
  id,
  campaign,
  COUNT(1) campaign_frequency,
  (
    SELECT COUNT(1) 
    FROM `project.dataset.table` 
    WHERE id = t.id
    AND dt BETWEEN  DATE_SUB(t.campaign_start, INTERVAL 3 MONTH) AND DATE_SUB(t.campaign_start, INTERVAL 1 DAY)
  ) total_lookback_frequency,
  campaign_start,
  campaign_end,
  MIN(dt) AS first_date,
  MAX(dt) AS end_date
FROM `project.dataset.table` t
GROUP BY id, campaign, campaign_start, campaign_end
-- ORDER BY id, campaign