查询每行最后N个相关行

时间:2014-09-21 08:59:36

标签: sql performance postgresql indexing query-optimization

我有以下查询,该查询为每个id提取最新N observations的{​​{1}}:

station

我在SELECT id FROM ( SELECT station_id, id, created_at, row_number() OVER(PARTITION BY station_id ORDER BY created_at DESC) AS rn FROM ( SELECT station_id, id, created_at FROM observations ) s ) s WHERE rn <= #{n} ORDER BY station_id, created_at DESC; idstation_id上有索引。

这是我提出的唯一可以获取每个站点多个记录的解决方案。但它很慢(对于81000条记录的表格为154.0毫秒)。

如何加快查询速度?

2 个答案:

答案 0 :(得分:5)

至少假设Postgres 9.3。

索引

首先,多列索引将有所帮助:

CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)

created_at DESC稍微适合一点,但索引仍会以几乎相同的速度向后扫描而不会DESC

假设created_at已定义为NOT NULL,则在索引查询中考虑DESC NULLS LAST

最后一列id仅在您获得index-only scan时才有用,如果您不断添加大量新行,这可能无效。在这种情况下,请从索引中删除id

更简单的查询(仍然很慢)

简化您的查询,内部子选择无效:

SELECT id
FROM  (
  SELECT station_id, id, created_at
       , row_number() OVER (PARTITION BY station_id
                            ORDER BY created_at DESC) AS rn
  FROM   observations
  ) s
WHERE  rn <= #{n}  -- your limit here
ORDER  BY station_id, created_at DESC;

应该快一点,但仍然很慢。

快速查询

  • 假设每个电台有相对 少数电台和相对许多观察。
  • 还假设station_id ID定义为NOT NULL

要快速 ,您需要相当于松散索引扫描(尚未在Postgres中实现)。相关回答:

如果您有一个单独的stations表(似乎很可能),您可以使用JOIN LATERAL(Postgres 9.3 +)来模拟它:

SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id
   FROM   observations o
   WHERE  o.station_id = s.station_id  -- lateral reference
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
ORDER  BY s.station_id, o.created_at DESC;

如果你没有stations的表,那么下一个最好的方法就是创建和维护一个表。可能添加外键引用以强制执行关系完整性。

如果这不是一个选项,你可以动态提炼这样一个表。简单的选择是:

SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;

但要么需要顺序扫描又要慢。使Postgres使用上面的索引(或任何带有station_id的btree索引作为前导列)并使用递归CTE

WITH RECURSIVE stations AS (
   (                  -- extra pair of parentheses ...
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )                  -- ... is required!
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL  -- serves as break condition
   )
SELECT station_id
FROM   stations
WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL

将其用作上述简单查询中stations表的插入式替换

WITH RECURSIVE stations AS (
   (
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL
   )
SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id, o.created_at
   FROM   observations o
   WHERE  o.station_id = s.station_id
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
WHERE  s.station_id IS NOT NULL
ORDER  BY s.station_id, o.created_at DESC;

这仍应比数量级时更快。

SQL小提琴here(9.6)
db&lt;&gt;小提琴here

答案 1 :(得分:1)

只有当您不需要查询最新的实时数据时,这才是一个好的回答。

准备(需要postgresql 9.3)

drop materialized view test;
create materialized view test as select * from (
  SELECT station_id, id, created_at,
      row_number() OVER(
          PARTITION BY station_id
          ORDER BY created_at DESC
      ) as rn
  FROM (
      SELECT
          station_id,
          id,
          created_at
      FROM observations
  ) s
 ) q WHERE q.rn <= 100 -- use a value that will be your max limit number for further queries
ORDER BY station_id, rn DESC ;


create index idx_test on test(station_id,rn,created_at);

如何查询数据:

select * from test where rn<10 order by station_id,created_at;

我的机器上的原始查询是281毫秒,而这个新查询是15毫秒。

如何使用最新数据更新视图:

refresh materialized view test;

我有另一个解决方案,不需要物化视图,并使用实时,最新的数据。但鉴于您不需要最新数据,这种物化视图效率更高。