我很难弄清楚如何创建一个查询,该查询可以判断是否有7天没有任何活动(secondsPlayed == 0)在任何用户输入之前,如果是,则以1表示,否则为0。
这还意味着,如果用户的条目少于7个,则所有条目的值均为0。
Input table: +------------------------------+-------------------------+---------------+ | userid | estimationDate | secondsPlayed | +------------------------------+-------------------------+---------------+ | a | 2016-07-14 00:00:00 UTC | 192.5 | | a | 2016-07-15 00:00:00 UTC | 357.3 | | a | 2016-07-16 00:00:00 UTC | 0 | | a | 2016-07-17 00:00:00 UTC | 0 | | a | 2016-07-18 00:00:00 UTC | 0 | | a | 2016-07-19 00:00:00 UTC | 0 | | a | 2016-07-20 00:00:00 UTC | 0 | | a | 2016-07-21 00:00:00 UTC | 0 | | a | 2016-07-22 00:00:00 UTC | 0 | | a | 2016-07-23 00:00:00 UTC | 0 | | a | 2016-07-24 00:00:00 UTC | 0 | | ---------------------------- | ---------------------- | ---- | | b | 2016-07-02 00:00:00 UTC | 31.2 | | b | 2016-07-03 00:00:00 UTC | 42.1 | | b | 2016-07-04 00:00:00 UTC | 41.9 | | b | 2016-07-05 00:00:00 UTC | 43.2 | | b | 2016-07-06 00:00:00 UTC | 91.5 | | b | 2016-07-07 00:00:00 UTC | 0 | | b | 2016-07-08 00:00:00 UTC | 0 | | b | 2016-07-09 00:00:00 UTC | 239.1 | | b | 2016-07-10 00:00:00 UTC | 0 | +------------------------------+-------------------------+---------------+
预期的输出表应如下所示:
Output table: +------------------------------+-------------------------+---------------+----------+ | userid | estimationDate | secondsPlayed | inactive | +------------------------------+-------------------------+---------------+----------+ | a | 2016-07-14 00:00:00 UTC | 192.5 | 0 | | a | 2016-07-15 00:00:00 UTC | 357.3 | 0 | | a | 2016-07-16 00:00:00 UTC | 0 | 0 | | a | 2016-07-17 00:00:00 UTC | 0 | 0 | | a | 2016-07-18 00:00:00 UTC | 0 | 0 | | a | 2016-07-19 00:00:00 UTC | 0 | 0 | | a | 2016-07-20 00:00:00 UTC | 0 | 0 | | a | 2016-07-21 00:00:00 UTC | 0 | 0 | | a | 2016-07-22 00:00:00 UTC | 0 | 1 | | a | 2016-07-23 00:00:00 UTC | 0 | 1 | | a | 2016-07-24 00:00:00 UTC | 0 | 1 | | ---------------------------- | ----------------------- | ----- | ----- | | b | 2016-07-02 00:00:00 UTC | 31.2 | 0 | | b | 2016-07-03 00:00:00 UTC | 42.1 | 0 | | b | 2016-07-04 00:00:00 UTC | 41.9 | 0 | | b | 2016-07-05 00:00:00 UTC | 43.2 | 0 | | b | 2016-07-06 00:00:00 UTC | 91.5 | 0 | | b | 2016-07-07 00:00:00 UTC | 0 | 0 | | b | 2016-07-08 00:00:00 UTC | 0 | 0 | | b | 2016-07-09 00:00:00 UTC | 239.1 | 0 | | b | 2016-07-10 00:00:00 UTC | 0 | 0 | +------------------------------+-------------------------+---------------+----------+
起初,我正在考虑使用具有7个偏移量的Lag函数,但这显然与两者之间的任何主题都不相关。
我还在考虑创建7天的滚动窗口/平均值,并评估该平均值是否大于0。但是,这可能会比我的技能水平高一点。
任何人都可以很好地解决这个问题。
答案 0 :(得分:3)
假设您每天都有数据(这似乎是一个合理的假设),则可以对窗口函数求和:
select t.*,
(case when sum(secondsplayed) over (partition by userid
order by estimationdate
rows between 6 preceding and current row
) = 0 and
row_number() over (partition by userid order by estimationdate) >= 7
then 1
else 0
end) as inactive
from t;
除了日期中没有空洞之外,这还假设secondsplayed
永远不会为负。 (负值可以很容易地合并到逻辑中,但这似乎是不必要的。)
答案 1 :(得分:2)
根据我的经验,这种类型的输入表不包含不活动条目,通常看起来像这样(这里仅存在活动条目)
Input table: +------------------------------+-------------------------+---------------+ | userid | estimationDate | secondsPlayed | +------------------------------+-------------------------+---------------+ | a | 2016-07-14 00:00:00 UTC | 192.5 | | a | 2016-07-15 00:00:00 UTC | 357.3 | | ---------------------------- | ---------------------- | ---- | | b | 2016-07-02 00:00:00 UTC | 31.2 | | b | 2016-07-03 00:00:00 UTC | 42.1 | | b | 2016-07-04 00:00:00 UTC | 41.9 | | b | 2016-07-05 00:00:00 UTC | 43.2 | | b | 2016-07-06 00:00:00 UTC | 91.5 | | b | 2016-07-09 00:00:00 UTC | 239.1 | +------------------------------+-------------------------+---------------+
因此,以下是BigQuery标准SQL的输入,如上
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' userid, TIMESTAMP '2016-07-14 00:00:00 UTC' estimationDate, 192.5 secondsPlayed UNION ALL
SELECT 'a', '2016-07-15 00:00:00 UTC', 357.3 UNION ALL
SELECT 'b', '2016-07-02 00:00:00 UTC', 31.2 UNION ALL
SELECT 'b', '2016-07-03 00:00:00 UTC', 42.1 UNION ALL
SELECT 'b', '2016-07-04 00:00:00 UTC', 41.9 UNION ALL
SELECT 'b', '2016-07-05 00:00:00 UTC', 43.2 UNION ALL
SELECT 'b', '2016-07-06 00:00:00 UTC', 91.5 UNION ALL
SELECT 'b', '2016-07-09 00:00:00 UTC', 239.1
), time_frame AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2016-07-02', '2016-07-24')) day
)
SELECT
users.userid,
day,
IFNULL(secondsPlayed, 0) secondsPlayed,
CAST(1 - SIGN(SUM(IFNULL(secondsPlayed, 0))
OVER(
PARTITION BY users.userid
ORDER BY UNIX_DATE(day)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
)) AS INT64) AS inactive
FROM time_frame tf
CROSS JOIN (SELECT DISTINCT userid FROM `project.dataset.table`) users
LEFT JOIN `project.dataset.table` t
ON day = DATE(estimationDate) AND users.userid = t.userid
ORDER BY userid, day
有结果
Row userid day secondsPlayed inactive ... 13 a 2016-07-14 192.5 0 14 a 2016-07-15 357.3 0 15 a 2016-07-15 357.3 0 16 a 2016-07-16 0.0 0 17 a 2016-07-17 0.0 0 18 a 2016-07-18 0.0 0 19 a 2016-07-19 0.0 0 20 a 2016-07-20 0.0 0 21 a 2016-07-21 0.0 0 22 a 2016-07-22 0.0 1 23 a 2016-07-23 0.0 1 24 a 2016-07-24 0.0 1 25 b 2016-07-02 31.2 0 26 b 2016-07-03 42.1 0 27 b 2016-07-04 41.9 0 28 b 2016-07-05 43.2 0 29 b 2016-07-06 91.5 0 30 b 2016-07-07 0.0 0 31 b 2016-07-08 0.0 0 32 b 2016-07-09 239.1 0 33 b 2016-07-10 0.0 0 ...