我有以下查询,该查询为每个id
提取最新N observations
的{{1}}:
station
我在SELECT id
FROM (
SELECT station_id, id, created_at,
row_number() OVER(PARTITION BY station_id
ORDER BY created_at DESC) AS rn
FROM (
SELECT station_id, id, created_at
FROM observations
) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;
,id
,station_id
上有索引。
这是我提出的唯一可以获取每个站点多个记录的解决方案。但它很慢(对于81000条记录的表格为154.0毫秒)。
如何加快查询速度?
答案 0 :(得分:5)
至少假设Postgres 9.3。
首先,多列索引将有所帮助:
CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)
created_at DESC
稍微适合一点,但索引仍会以几乎相同的速度向后扫描而不会DESC
。
假设created_at
已定义为NOT NULL
,则在索引和查询中考虑DESC NULLS LAST
:
最后一列id
仅在您获得index-only scan时才有用,如果您不断添加大量新行,这可能无效。在这种情况下,请从索引中删除id
。
简化您的查询,内部子选择无效:
SELECT id
FROM (
SELECT station_id, id, created_at
, row_number() OVER (PARTITION BY station_id
ORDER BY created_at DESC) AS rn
FROM observations
) s
WHERE rn <= #{n} -- your limit here
ORDER BY station_id, created_at DESC;
应该快一点,但仍然很慢。
station_id
ID定义为NOT NULL
。要快速 ,您需要相当于松散索引扫描(尚未在Postgres中实现)。相关回答:
如果您有一个单独的stations
表(似乎很可能),您可以使用JOIN LATERAL
(Postgres 9.3 +)来模拟它:
SELECT o.id
FROM stations s
CROSS JOIN LATERAL (
SELECT o.id
FROM observations o
WHERE o.station_id = s.station_id -- lateral reference
ORDER BY o.created_at DESC
LIMIT #{n} -- your limit here
) o
ORDER BY s.station_id, o.created_at DESC;
如果你没有stations
的表,那么下一个最好的方法就是创建和维护一个表。可能添加外键引用以强制执行关系完整性。
如果这不是一个选项,你可以动态提炼这样一个表。简单的选择是:
SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;
但要么需要顺序扫描又要慢。使Postgres使用上面的索引(或任何带有station_id
的btree索引作为前导列)并使用递归CTE :
WITH RECURSIVE stations AS (
( -- extra pair of parentheses ...
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
) -- ... is required!
UNION ALL
SELECT (SELECT o.station_id
FROM observations o
WHERE o.station_id > s.station_id
ORDER BY o.station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL -- serves as break condition
)
SELECT station_id
FROM stations
WHERE station_id IS NOT NULL; -- remove dangling row with NULL
将其用作上述简单查询中stations
表的插入式替换:
WITH RECURSIVE stations AS (
(
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
)
UNION ALL
SELECT (SELECT o.station_id
FROM observations o
WHERE o.station_id > s.station_id
ORDER BY o.station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL
)
SELECT o.id
FROM stations s
CROSS JOIN LATERAL (
SELECT o.id, o.created_at
FROM observations o
WHERE o.station_id = s.station_id
ORDER BY o.created_at DESC
LIMIT #{n} -- your limit here
) o
WHERE s.station_id IS NOT NULL
ORDER BY s.station_id, o.created_at DESC;
这仍应比数量级时更快。
答案 1 :(得分:1)
只有当您不需要查询最新的实时数据时,这才是一个好的回答。
准备(需要postgresql 9.3)
drop materialized view test;
create materialized view test as select * from (
SELECT station_id, id, created_at,
row_number() OVER(
PARTITION BY station_id
ORDER BY created_at DESC
) as rn
FROM (
SELECT
station_id,
id,
created_at
FROM observations
) s
) q WHERE q.rn <= 100 -- use a value that will be your max limit number for further queries
ORDER BY station_id, rn DESC ;
create index idx_test on test(station_id,rn,created_at);
如何查询数据:
select * from test where rn<10 order by station_id,created_at;
我的机器上的原始查询是281毫秒,而这个新查询是15毫秒。
如何使用最新数据更新视图:
refresh materialized view test;
我有另一个解决方案,不需要物化视图,并使用实时,最新的数据。但鉴于您不需要最新数据,这种物化视图效率更高。