How to combine multiple SELECTs into a single SELECT by a common column in (BigQuery) SQL?

时间:2017-09-20 13:04:58

标签: sql google-bigquery

Given I have multiple tables in BigQuery, hence I have multiple SQL-statements that gives me "the number of X per day". For example:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as installs
FROM database.table1
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | installs |
-------------------------
| 2017-01-01 | 11       |
| 2017-01-02 | 22       |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as uninstalls
FROM database.table2
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | uninstalls |
---------------------------
| 2017-01-02 | 22         |
| 2017-01-03 | 33         |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as cases
FROM database.table3
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | cases |
----------------------
| 2017-01-01 | 11    |
| 2017-01-03 | 33    |
etc

etc

Now I need to combine all these into a single SELECT statement that gives the following results:

| day        | installs | uninstalls | cases |
----------------------------------------------
| 2017-01-01 | 11       | 0          | 11    |
| 2017-01-02 | 22       | 22         | 0     |
| 2017-01-03 | 0        | 33         | 33    |
etc

Is this even possible?

Or what's the closest SQL-statement I can write that would give me a similar result?

Any feedback is appreciated!

3 个答案:

答案 0 :(得分:2)

这是一个自包含的示例,可能有助于您入门。它使用两个虚拟表InstallEventsUninstallEvents,其中包含相应操作的时间戳。它创建一个名为StartAndEnd的公用表表达式,用于计算这些事件的最小和最大日期,以便确定要聚合的日期,然后联合InstallEventsUninstallEvents的内容,计算每一天的事件。

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
),
StartAndEnd AS (
  SELECT MIN(DATE(timestamp)) AS min_date, MAX(DATE(timestamp)) AS max_date
  FROM (
    SELECT * FROM InstallEvents UNION ALL
    SELECT * FROM UninstallEvents
  )
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(
    (SELECT min_date FROM StartAndEnd),
    (SELECT max_date FROM StartAndEnd)
  )) AS day
GROUP BY day
ORDER BY day;

如果您事先了解开始日期和结束日期,则可以在查询中对其进行硬编码,然后省略StartAndEnd CTE:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-04')) AS day
GROUP BY day
ORDER BY day;

要查看示例数据中的事件,请使用联合内容的查询:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT timestamp, true AS is_install
FROM InstallEvents UNION ALL
SELECT timestamp, false
FROM UninstallEvents;

答案 1 :(得分:1)

以下是BigQuery Standard SQL

   
#standardSQL
WITH calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
-- ORDER BY day  

请注意:您使用时间戳作为列名称,这不是最佳做法,因为它是关键字,因此在我的示例中,我保留您的命名,但请考虑更改此名称!

您可以使用以下虚拟数据

测试/播放此解决方案
#standardSQL
WITH `database.table1` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  
  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 
),
`database.table2` AS (
  SELECT TIMESTAMP '2016-12-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
`database.table3` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-10', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
ORDER BY day

答案 2 :(得分:0)

我对bigquery不太熟悉,所以这可能不是复制粘贴的答案。

您首先必须构建一个压缩表,以确保您拥有所有日期。 Here's sql server的一个例子。也许有大型查询的例子。以下假定Calander中包含Date属性的timestamp表。

一旦你有了calander表,你可以将所有表加入到:

SELECT      FORMAT_TIMESTAMP("%F",C.Date) AS day
,           COUNT(T1.DATE(T1.TIMESTAMP)) AS installs --Here you could also use your FORMAT_TIMESTAMP
,           COUNT(T1.DATE(T2.TIMESTAMP)) AS uninstalls
FROM        Calander C
LEFT JOIN   database.table1 T1
        ON  DATE(T1.TIMESTAMP) = DATE(C.Date) --Convert to date to remove times, you could also use your FORMAT_TIMESTAMP
LEFT JOIN   database.table2 T2
        ON  DATE(T2.TIMESTAMP) = DATE(C.Date)
GROUP BY    day
ORDER BY    day ASC