PostgreSQL递归聚合窗口

时间:2017-03-22 09:40:42

标签: sql postgresql

我有一系列Event s,由不同用户随着时间的推移而生成。

如何通过彼此接近的事件聚合此系列。如果出现以下两个事件(在同一窗口中):

    b.user = a.user
and b.time >= a.time
and b.time - a.time <= interval '1 month'

这是递归条件。例如,以下数据集:

CREATE TABLE pg_temp.Data
    ("event" int, "user" int, "date" date, "value" int)
;

INSERT INTO pg_temp.Data
    ("event", "user", "date", "value")
VALUES
    (1, 1, '2017-01-01', 5),
    (2, 1, '2017-01-07', 3),
    (3, 1, '2017-02-09', 2),
    (4, 1, '2017-03-12', 4),
    (5, 1, '2017-04-03', 7),
    (6, 1, '2017-05-01', 6),
    (7, 2, '2017-01-05', 9),
    (8, 2, '2017-01-12', 1),
    (9, 2, '2017-03-24', 6)
;


select * from pg_temp.Data

应简化为:

[
    {
        "init": "2017-01-01",
        "latest": "2017-01-07",
        "events": [
            1,
            2
        ],
        "user": 1,
        "value": 8
    },
    {
        "init": "2017-02-09",
        "latest": "2017-02-09",
        "events": [
            3
        ],
        "user": 1,
        "value": 2
    },
    {
        "init": "2017-03-12",
        "latest": "2017-05-01",
        "events": [
            4,
            5,
            6
        ],
        "user": 1,
        "value": 17
    },
    {
        "init": "2017-01-05",
        "latest": "2017-01-12",
        "events": [
            7,
            8
        ],
        "user": 2,
        "value": 10
    },
    {
        "init": "2017-03-24",
        "latest": "2017-03-24",
        "events": [
            9
        ],
        "user": 2,
        "value": 6
    }
]

其中initlatest是窗口的时间范围,value是窗口中值的总和。

请注意,事件64相隔超过一个月,但由于事件5介于它们之间,因此它们已汇总到同一组中。

1 个答案:

答案 0 :(得分:3)

使用窗口功能:

SELECT min(date) AS init,
       max(date) AS latest,
       array_agg(event) AS events,
       "user",
       sum(value) AS value
FROM (SELECT event,
             "user",
             date,
             value,
             count(grp_start)
                OVER (PARTITION BY "user" ORDER BY date) session_id
      FROM (SELECT event,
                   "user",
                   date,
                   value,
                   CASE
                      WHEN date
                         > lag(date, 1, timestamp '-infinity')
                              OVER (PARTITION BY "user" ORDER BY date)
                           + INTERVAL '1 month'
                      THEN 1
                   END grp_start
            FROM data
           ) tagged
     ) numbered
GROUP BY "user", session_id
ORDER BY "user", init;

这将导致:

┌─────────────────────┬─────────────────────┬─────────┬──────┬───────┐
│        init         │       latest        │ events  │ user │ value │
├─────────────────────┼─────────────────────┼─────────┼──────┼───────┤
│ 2017-01-01 00:00:00 │ 2017-01-07 00:00:00 │ {1,2}   │    1 │     8 │
│ 2017-02-09 00:00:00 │ 2017-02-09 00:00:00 │ {3}     │    1 │     2 │
│ 2017-03-12 00:00:00 │ 2017-05-01 00:00:00 │ {4,5,6} │    1 │    17 │
│ 2017-01-05 00:00:00 │ 2017-01-12 00:00:00 │ {7,8}   │    2 │    10 │
│ 2017-03-24 00:00:00 │ 2017-03-24 00:00:00 │ {9}     │    2 │     6 │
└─────────────────────┴─────────────────────┴─────────┴──────┴───────┘
(5 rows)

一句话af建议: 是一个好主意,使用user这样的列名作为保留字。如果你忘了在双引号中使用它们,那么就会发生令人惊讶的事情(尝试一下)。