我试图在PostgreSQL查询的WINDOW函数中找到与PARTITION BY子句中当前行进行比较的方法。
想象一下,我在这5个元素的以下查询中有一个简短列表(在实际情况下,我有数千甚至数百万行)。我试图获取每一行,下一个不同元素的id(事件列),以及前一个不同元素的id。
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);
我知道比较event!=ev.event
不正确,但这是我想要达到的目的。
我得到的结果是(和删除PARTITION BY子句一样):
|12|2
1|12|3
2|13|4
3|13|5
4|12|
我想得到的结果是:
|12|3
|12|3
2|13|5
2|13|5
4|12|
任何人都知道这是否可能以及如何?非常感谢你!
编辑:我知道我可以使用两个JOIN
,一个ORDER BY
和一个DISTINCT ON
,但实际情况是数百万行这是非常低效的:
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC
答案 0 :(得分:9)
使用几个不同的window functions和两个子查询,这应该可以很快地运行:
WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;
使用ts
作为时间戳列的名称
假设ts
是唯一的 - 并且 已编入索引 (a unique constraint会自动执行此操作)。
在具有50k行的真实生命表的测试中,它只需要单个索引扫描。所以,即使有大桌子也应该快得体。相比之下,您的join / distinct查询在一分钟后没有完成(如预期的那样) 即使是优化版本,一次处理一个交叉连接(左连接几乎没有限制条件实际上是有限的交叉连接)也不会在一分钟后完成。
为了使用大表获得最佳性能,请调整内存设置,尤其是work_mem
(对于大型排序操作)。如果可以节省RAM,请考虑暂时为会话设置更高(更高)。阅读更多here和here。
在子查询sub1
中查看上一行中的事件,并且只保留该事件(如果已更改),从而标记新组的第一个元素。同时,获取上一行和下一行的id
(pre_id
,post_id
)。
在子查询sub2
中,count()
仅计算非空值。结果grp
标记连续相同事件块中的对等。
在最后的SELECT
中,为每行获取每组的第一个pre_id
和最后一个post_id
,以获得所需的结果。
实际上,这应该在外部SELECT
:
last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS post_id
...因为窗口的排序顺序与pre_id
的窗口一致,所以只需要一个排序。快速测试似乎证实了这一点。 More about this frame definition.