我有一张表只是一个日期和用户ID列表(未汇总)。
我们通过计算过去45天内显示的不同ID数量,为给定日期定义一个名为活跃用户的指标。
我正在尝试在BigQuery中运行一个查询,该查询每天返回当天加上当天活跃用户的数量(从45天前算到不同用户直到今天)。
我已尝试过窗口函数,但无法弄清楚如何根据列中的日期值定义范围。相反,我相信以下查询可以在像MySQL这样的数据库中工作,但在BigQuery中不会。
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
这在BigQuery中不起作用:“SELECT子句中不允许进行子选择。”
如何在BigQuery中执行此操作?
答案 0 :(得分:1)
BigQuery documentation声称count(distinct)
可用作窗口函数。但是,这对您没有帮助,因为您不是在寻找传统的窗框。
一种方法会在访问后为每个日期添加记录:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
注意:可能有更简单的方法在BigQuery中生成一系列45个整数。
答案 1 :(得分:1)
下面应该使用BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
以上假设day
字段表示为' 2016-01-10'格式。
如果不是这种情况,您应该在大多数内部选择中调整TIMESTAMP_TO_SEC(TIMESTAMP(day))
另请参阅BigQuery中的COUNT(DISTINC)详细信息
BigQuery Standard SQL更新
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
您可以使用以下虚拟样本
进行测试/播放#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1