趋势总和随着时间的推移

时间:2013-07-22 16:21:17

标签: sql postgresql aggregate-functions window-functions

我有一张表(在Postgres 9.1中),看起来像这样:

CREATE TABLE actions (
  user_id: INTEGER,
  date:    DATE,
  action:  VARCHAR(255),
  count:   INTEGER
)

例如:

    user_id    |    date    |     action   | count
---------------+------------+--------------+-------
             1 | 2013-01-01 | Email        |     1
             1 | 2013-01-02 | Call         |     3
             1 | 2013-01-03 | Email        |     3
             1 | 2013-01-04 | Call         |     2
             1 | 2013-01-04 | Voicemail    |     2
             1 | 2013-01-04 | Email        |     2
             2 | 2013-01-04 | Email        |     2

我希望能够查看用户对一组特定操作的总体操作;例如,电话+电子邮件:

  user_id  | date        |  count  
-----------+-------------+---------
         1 | 2013-01-01  |       1
         1 | 2013-01-02  |       4
         1 | 2013-01-03  |       7
         1 | 2013-01-04  |      11
         2 | 2013-01-04  |       2

到目前为止,我创造的怪物是这样的:

SELECT
  date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
  actions
WHERE
  action IN ('Call', 'Email') 
GROUP BY
  user_id, date, count;

哪个适用于单个操作,但是当它们在同一天发生时似乎会中断,例如,而不是11上的预期2013-01-04,我们得到9

    date    |      user_id | count
------------+--------------+-------
 2013-01-01 | 1            |     1
 2013-01-02 | 1            |     4
 2013-01-03 | 1            |     7
 2013-01-04 | 1            |     9 <-- should be 11?
 2013-01-04 | 2            |     2

是否可以调整我的查询来解决此问题?我尝试删除count上的分组,但Postgres似乎不喜欢这样:

column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2:      date, user_id, SUM(count) OVER (PARTITION BY user...
                                ^

3 个答案:

答案 0 :(得分:1)

该表有一个名为“count”的列,SELECT子句中的表达式别名为“count”,它是不明确的。

阅读文档:http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY

  

如果含糊不清,GROUP BY名称将被解释为   输入列名称而不是输出列名称。

这意味着,您的查询不会按SELECT子句中计算的“count”进行分组,而是按表中的“count”值进行分组。

此查询提供了预期结果,请参阅SQL Fiddle

SELECT date, user_id, count
from (
   Select date, user_id, 
          SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
  FROM actions
  WHERE
    action IN ('Call', 'Email') 
) alias
GROUP BY
  user_id, date, count;

答案 1 :(得分:1)

断言

目前还不清楚您是想按user_id还是date

排序

还不清楚是否要在结果列表中包含日期,而基表中没有行。在这种情况下,请参考这个密切相关的答案:
PostgreSQL: running count of rows for a query 'by minute'

修复名称

首先,我使用此测试表而不是有问题的表格

CREATE TEMP TABLE actions (
  user_id integer,
  thedate    date,
  action  text,
  ct   integer
);

您使用reserved words和函数名称作为标识符(列名称)是问题的一部分。

修复查询

组合聚合和窗口函数

由于首先应用了聚合函数,因此原始查询会将为user_id = 1thedate = '2013-01-04'找到的两个行归为一个。您必须乘以count(*)才能获得实际的运行计数。

您可以执行此不带子查询,因为您可以组合聚合函数和窗口函数。首先应用聚合函数。您甚至可以在聚合函数的结果上使用窗口函数

SELECT thedate
     , user_id
     , sum(ct * count(*)) OVER (PARTITION BY user_id
                                ORDER BY thedate) AS running_ct
FROM   actions
WHERE  action IN ('Call', 'Email') 
GROUP  BY user_id, thedate, ct
ORDER  BY user_id, thedate;

或简化为:

...
 , sum(sum(ct)) OVER (PARTITION BY user_id
                      ORDER BY thedate) AS running_ct
...

这也应该是所提出解决方案的最快

这里,内部sum()是一个聚合函数,而外部sum()是一个窗口函数 - 在聚合函数的结果上。

或使用DISTINCT

另一种方法是使用DISTINCT or DISTINCT ON,因为在窗口函数之后应用

DISTINCT - 这是可能的,因为在这种情况下running_ct保证相同,因为default frame definition of window functions会立即对所有同伴进行求和。

SELECT DISTINCT
       thedate
     , user_id
     , sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM   actions
WHERE  action IN ('Call', 'Email')
ORDER  BY thedate, user_id;

或简化为DISTINCT ON

SELECT DISTINCT ON (thedate, user_id)
...

->SQLfiddle demonstrating all variants.

答案 2 :(得分:1)

此查询生成您要查找的结果:

SELECT DISTINCT   
  date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count 
  FROM actions
WHERE
  action IN ('Call', 'Email');

默认窗口已经是您想要的,according to the official docs和“DISTINCT”在同一天发生电子邮件和通话时消除了重复的行。

请参阅SQL Fiddle