7天用户计数:Big-Query自加入以获得日期范围和计数?

时间:2019-06-17 11:57:11

标签: sql google-bigquery

我的Google Firebase事件数据已集成到BigQuery中,我试图从此处获取Firebase自动向我提供的信息之一:1天,7天,28天用户数。

1天计数非常简单

SELECT
  "1-day" as period,
  events.event_date,
  count(distinct events.user_pseudo_id) as uid
FROM
  `your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date

具有整洁的结果

period   event_date  uid
1-day    20190609    5
1-day    20190610    7
1-day    20190611    5
1-day    20190612    7
1-day    20190613    37
1-day    20190614    73
1-day    20190615    52
1-day    20190616    36

但是对我来说,当我尝试每天计算过去7天中有多少不重复用户时,情况变得很复杂 通过上面的查询,通过过滤7天并按条件删除该组,我知道我的20190616天的目标值为142。

我尝试的解决方案是直接自连接(以及没有改变结果的变体)

SELECT
  "7-day" as period,
  events.event_date,
  count(distinct user_events.user_pseudo_id) as uid
FROM
  `your_path.events_*` as events,
  `your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
  and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
  and events.event_date = "20190616" #fixed date to check
group by events.event_date

现在,我知道我几乎没有设置任何连接条件,但是如果有的话,我希望产生交叉连接和巨大的结果。取而代之的是,这种方式的计数为70,比预期的要低很多。此外,我可以设置INTERVAL 2 DAY,结果不会改变。

我在这里显然做错了什么,但我也认为我的做事方式很初级,必须有一种更明智的方式来实现。

我已经检查了Calculating a current day 7 day active user with BigQuery?,但此处的显式交叉连接与event_dim有关,我不确定该定义


根据评论建议,选择了Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)提供的解决方案。 该解决方案起初听起来似乎不错,但最近的一天存在一些问题。这是我根据情况调整的使用COUNT(DISTINCT)的查询

SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
 , COUNT(DISTINCT user_pseudo_id) unique_90_day_users
 , COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
 , COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
 , COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
  SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
  FROM `your_path_here.events_*`
  WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
  GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

这是最近几天的结果(考虑数据从5月23日开始),您可以意识到结果是错误的

row_num   date_grp     90-day  28-day  7-day   1-day
114       2019-06-16   273     273     273     210
115       2019-06-17   78      78      78      78

因此,在最后一天中,这90天,28天,7天只考虑同一天而不是之前的所有天。 如果6月16日的1天数较高,那么6月17日的90天数不可能为78。

1 个答案:

答案 0 :(得分:0)

这是对我相同问题的回答 AN 。 由于我对BQ快捷键和一些高级功能不是很熟悉,所以我的方法是基本的,但是结果仍然是正确的。 我希望其他人能够与更好的查询集成。

#standardSQL
WITH dates AS (
  SELECT i as event_date
  FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
  SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
  FROM dates,
    UNNEST(GENERATE_ARRAY(1, 90)) i
  UNION ALL
  SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
  FROM dates,
    UNNEST(GENERATE_ARRAY(1, 29)) i
  UNION ALL
  SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
  FROM dates,
    UNNEST(GENERATE_ARRAY(1, 7)) i
  UNION ALL
  SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
  FROM dates
)


SELECT event_date,
  sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
  sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
  sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
  sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
  , ptd_dates.event_date
  , COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
  `your_path_here.events_*` events,
  unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
  , ptd_dates.event_date)
group by event_date
order by 1,2,3

根据ECris的建议,我首先定义了要使用的日历表:该表包含4类PTD(迄今为止的时间)。每个元素都是由基本元素生成的:这应该线性缩放,因为它不查询事件数据集,因此没有空隙。

然后通过事件进行联接,联接条件显示事件在每个日期中我如何计算该期间所有相关日期中的不同用户。

结果正确。