BigQuery中的第N天保留时间,错误消息:无效的时区

时间:2018-11-19 21:52:07

标签: sql datetime timezone google-bigquery standard-sql

我正在尝试计算Google Big Query中数据集的第N天保留时间。该表包含一个移动应用程序一个月的数据,我想找出每天返回的用户数量。我正在使用standardSQL。到目前为止,我拥有的代码是

SELECT date(d1.eventDate) as dt,
        COUNT(distinct d1.userID) as total_users,
        COUNT(distinct d2.userID) as retained_users
         FROM `dataset` as d1
        LEFT JOIN `dataset` as d2 ON 
        d1.userID = d2.userID
        AND date(d1.eventDate) = date(datetime(d2.eventDate, '-1 day'))
          GROUP BY 1
          ORDER BY 1"

当我尝试执行时,我收到错误消息

  Error: Invalid time zone: -1 day [invalidQuery]

我的表结构是

    eventDate           | UserID | 
2016-05-06 00:00:00 UTC | 100000 |
2016-05-06 00:00:00 UTC | 200000 |
2016-05-06 00:00:00 UTC | 300000 |

我应该使用什么代替“ -1天”?

2 个答案:

答案 0 :(得分:1)

TIMESTAMP_SUB可以按照书面形式解决查询问题,但由于性能原因,它可能不足以作为解决方案。但至少可以让您减去1天:

SELECT date(d1.created_at) as dt,
        COUNT(distinct d1.actor.id) as total_users,
        COUNT(distinct d2.actor.id) as retained_users
         FROM `githubarchive.month.201810` as d1
        LEFT JOIN `githubarchive.month.201810` as d2 ON 
        d1.actor.id = d2.actor.id
        AND date(d1.created_at) = date(TIMESTAMP_SUB(d2.created_at, INTERVAL -24 HOUR))
          GROUP BY 1
          ORDER BY 1

要提高性能,请在JOIN之前进行一些重复数据删除:

SELECT day as dt,
    COUNT(distinct d1.id) as total_users,
    COUNT(distinct d2.id) as retained_users
FROM (SELECT DISTINCT actor.id, DATE(created_at) day FROM `githubarchive.month.201810`)as d1
LEFT JOIN (SELECT DISTINCT actor.id,  DATE(TIMESTAMP_SUB(created_at, INTERVAL -24 HOUR)) day FROM `githubarchive.month.201810`) as d2 
USING (id, day)
GROUP BY 1
ORDER BY 1

enter image description here

答案 1 :(得分:0)

以下内容适用于BigQuery Standard SQL,并且经过了进一步优化,以不使用任何JOIN而是使用解析函数

#standardSQL
SELECT
  day, 
  COUNT(1) total_users,
  COUNTIF(delta = 1) retained_users
FROM (
  SELECT
    day, id, 
    DATE_DIFF(day, LAG(day) OVER(PARTITION BY id ORDER BY day), DAY) delta
  FROM (
    SELECT DISTINCT
      DATE(created_at) day,
      actor.id
    FROM `githubarchive.month.201810`
  )
)
GROUP BY day
ORDER BY day   

或者,如果使用原始问题的表示法:

#standardSQL
SELECT
  day, 
  COUNT(1) total_users,
  COUNTIF(delta = 1) retained_users
FROM (
  SELECT
    day, userID, 
    DATE_DIFF(day, LAG(day) OVER(PARTITION BY userID ORDER BY day), DAY) delta
  FROM (
    SELECT DISTINCT
      DATE(eventDate) day,
      userID
    FROM `project.dataset.table`
  )
)
GROUP BY day
ORDER BY day