SQL查询基于最后七个条目设置列

时间:2018-09-11 13:40:48

标签: sql google-bigquery

问题

我很难弄清楚如何创建一个查询,该查询可以判断是否有7天没有任何活动(secondsPlayed == 0)在任何用户输入之前,如果是,则以1表示,否则为0。

这还意味着,如果用户的条目少于7个,则所有条目的值均为0。


Input table:
+------------------------------+-------------------------+---------------+
|            userid            |     estimationDate      | secondsPlayed |
+------------------------------+-------------------------+---------------+
| a                            | 2016-07-14 00:00:00 UTC | 192.5         |
| a                            | 2016-07-15 00:00:00 UTC | 357.3         |
| a                            | 2016-07-16 00:00:00 UTC | 0             |
| a                            | 2016-07-17 00:00:00 UTC | 0             |
| a                            | 2016-07-18 00:00:00 UTC | 0             |
| a                            | 2016-07-19 00:00:00 UTC | 0             |
| a                            | 2016-07-20 00:00:00 UTC | 0             |
| a                            | 2016-07-21 00:00:00 UTC | 0             |
| a                            | 2016-07-22 00:00:00 UTC | 0             |
| a                            | 2016-07-23 00:00:00 UTC | 0             |
| a                            | 2016-07-24 00:00:00 UTC | 0             |
| ---------------------------- | ----------------------  | ----          |
| b                            | 2016-07-02 00:00:00 UTC | 31.2          |
| b                            | 2016-07-03 00:00:00 UTC | 42.1          |
| b                            | 2016-07-04 00:00:00 UTC | 41.9          |
| b                            | 2016-07-05 00:00:00 UTC | 43.2          |
| b                            | 2016-07-06 00:00:00 UTC | 91.5          |
| b                            | 2016-07-07 00:00:00 UTC | 0             |
| b                            | 2016-07-08 00:00:00 UTC | 0             |
| b                            | 2016-07-09 00:00:00 UTC | 239.1         |
| b                            | 2016-07-10 00:00:00 UTC | 0             |
+------------------------------+-------------------------+---------------+

预期的输出表应如下所示:


Output table:

+------------------------------+-------------------------+---------------+----------+
|            userid            |     estimationDate      | secondsPlayed | inactive |
+------------------------------+-------------------------+---------------+----------+
| a                            | 2016-07-14 00:00:00 UTC | 192.5         | 0        |
| a                            | 2016-07-15 00:00:00 UTC | 357.3         | 0        |
| a                            | 2016-07-16 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-17 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-18 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-19 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-20 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-21 00:00:00 UTC | 0             | 0        |
| a                            | 2016-07-22 00:00:00 UTC | 0             | 1        |
| a                            | 2016-07-23 00:00:00 UTC | 0             | 1        |
| a                            | 2016-07-24 00:00:00 UTC | 0             | 1        |
| ---------------------------- | ----------------------- | -----         | -----    |
| b                            | 2016-07-02 00:00:00 UTC | 31.2          | 0        |
| b                            | 2016-07-03 00:00:00 UTC | 42.1          | 0        |
| b                            | 2016-07-04 00:00:00 UTC | 41.9          | 0        |
| b                            | 2016-07-05 00:00:00 UTC | 43.2          | 0        |
| b                            | 2016-07-06 00:00:00 UTC | 91.5          | 0        |
| b                            | 2016-07-07 00:00:00 UTC | 0             | 0        |
| b                            | 2016-07-08 00:00:00 UTC | 0             | 0        |
| b                            | 2016-07-09 00:00:00 UTC | 239.1         | 0        |
| b                            | 2016-07-10 00:00:00 UTC | 0             | 0        |
+------------------------------+-------------------------+---------------+----------+


想法

起初,我正在考虑使用具有7个偏移量的Lag函数,但这显然与两者之间的任何主题都不相关。

我还在考虑创建7天的滚动窗口/平均值,并评估该平均值是否大于0。但是,这可能会比我的技能水平高一点。

任何人都可以很好地解决这个问题。

2 个答案:

答案 0 :(得分:3)

假设您每天都有数据(这似乎是一个合理的假设),则可以对窗口函数求和:

select t.*,
       (case when sum(secondsplayed) over (partition by userid
                                           order by estimationdate
                                           rows between 6 preceding and current row
                                          ) = 0 and
                  row_number() over (partition by userid order by estimationdate) >= 7
             then 1
             else 0
        end) as inactive                  
from t;

除了日期中没有空洞之外,这还假设secondsplayed永远不会为负。 (负值可以很容易地合并到逻辑中,但这似乎是不必要的。)

答案 1 :(得分:2)

根据我的经验,这种类型的输入表不包含不活动条目,通常看起来像这样(这里仅存在活动条目)


Input table:
+------------------------------+-------------------------+---------------+
|            userid            |     estimationDate      | secondsPlayed |
+------------------------------+-------------------------+---------------+
| a                            | 2016-07-14 00:00:00 UTC | 192.5         |
| a                            | 2016-07-15 00:00:00 UTC | 357.3         |
| ---------------------------- | ----------------------  | ----          |
| b                            | 2016-07-02 00:00:00 UTC | 31.2          |
| b                            | 2016-07-03 00:00:00 UTC | 42.1          |
| b                            | 2016-07-04 00:00:00 UTC | 41.9          |
| b                            | 2016-07-05 00:00:00 UTC | 43.2          |
| b                            | 2016-07-06 00:00:00 UTC | 91.5          |
| b                            | 2016-07-09 00:00:00 UTC | 239.1         |
+------------------------------+-------------------------+---------------+

因此,以下是BigQuery标准SQL的输入,如上

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' userid, TIMESTAMP '2016-07-14 00:00:00 UTC' estimationDate, 192.5 secondsPlayed UNION ALL
  SELECT 'a', '2016-07-15 00:00:00 UTC', 357.3 UNION ALL
  SELECT 'b', '2016-07-02 00:00:00 UTC', 31.2 UNION ALL
  SELECT 'b', '2016-07-03 00:00:00 UTC', 42.1 UNION ALL
  SELECT 'b', '2016-07-04 00:00:00 UTC', 41.9 UNION ALL
  SELECT 'b', '2016-07-05 00:00:00 UTC', 43.2 UNION ALL
  SELECT 'b', '2016-07-06 00:00:00 UTC', 91.5 UNION ALL
  SELECT 'b', '2016-07-09 00:00:00 UTC', 239.1 
), time_frame AS (
  SELECT day
  FROM UNNEST(GENERATE_DATE_ARRAY('2016-07-02', '2016-07-24')) day
)
SELECT 
  users.userid, 
  day, 
  IFNULL(secondsPlayed, 0) secondsPlayed,
  CAST(1 - SIGN(SUM(IFNULL(secondsPlayed, 0)) 
    OVER(
      PARTITION BY users.userid 
      ORDER BY UNIX_DATE(day)
      RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
    )) AS INT64) AS inactive 
FROM time_frame tf
CROSS JOIN (SELECT DISTINCT userid FROM `project.dataset.table`) users
LEFT JOIN `project.dataset.table` t
ON day = DATE(estimationDate) AND users.userid = t.userid
ORDER BY userid, day   

有结果

  
Row userid  day         secondsPlayed   inactive     
...
13  a       2016-07-14  192.5           0    
14  a       2016-07-15  357.3           0    
15  a       2016-07-15  357.3           0    
16  a       2016-07-16  0.0             0    
17  a       2016-07-17  0.0             0    
18  a       2016-07-18  0.0             0    
19  a       2016-07-19  0.0             0    
20  a       2016-07-20  0.0             0    
21  a       2016-07-21  0.0             0    
22  a       2016-07-22  0.0             1    
23  a       2016-07-23  0.0             1    
24  a       2016-07-24  0.0             1    
25  b       2016-07-02  31.2            0    
26  b       2016-07-03  42.1            0    
27  b       2016-07-04  41.9            0    
28  b       2016-07-05  43.2            0    
29  b       2016-07-06  91.5            0    
30  b       2016-07-07  0.0             0    
31  b       2016-07-08  0.0             0    
32  b       2016-07-09  239.1           0    
33  b       2016-07-10  0.0             0    
...