Question

我有一张表pings，里面有大约1500万行。我在postgres 9.2.4。它具有的相关列是外键monitor_id，created_at时间戳和response_time，它是一个表示毫秒的整数。这是确切的结构：

     Column      |            Type             |                     Modifiers                      
-----------------+-----------------------------+----------------------------------------------------
 id              | integer                     | not null default nextval('pings_id_seq'::regclass)
 url             | character varying(255)      | 
 monitor_id      | integer                     | 
 response_status | integer                     | 
 response_time   | integer                     | 
 created_at      | timestamp without time zone | 
 updated_at      | timestamp without time zone | 
 response_body   | text                        | 
Indexes:
    "pings_pkey" PRIMARY KEY, btree (id)
    "index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
    "index_pings_on_monitor_id" btree (monitor_id)

我想查询非NULL的所有响应时间（90％不是NULL，大约10％将是NULL），具有特定的{ {1}}，这是上个月创建的。我正在使用ActiveRecord进行查询，但最终结果看起来像这样：

monitor_id

这是一个非常基本的查询，但运行大约需要2000毫秒，这似乎相当慢。我假设索引会使它更快，但我尝试过的所有索引都不起作用，我假设这意味着我没有正确编制索引。

当我运行SELECT "pings"."response_time" FROM "pings" WHERE "pings"."monitor_id" = 3 AND (created_at > '2014-03-03 20:23:07.254281' AND response_time IS NOT NULL)时，这就是我得到的：

EXPLAIN ANALYZE

因此，Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1) Recheck Cond: (monitor_id = 3) Rows Removed by Index Recheck: 11643313 Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone)) Rows Removed by Filter: 324834 -> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1) Index Cond: (monitor_id = 3)上有一个索引正在使用，但没有别的。我使用monitor_id，monitor_id和created_at尝试了各种复合索引的排列和顺序。我尝试按response_time降序排序索引。我尝试了created_at的部分索引。

我尝试过的任何内容都不会让查询更快。你会如何优化和/或索引它？

Answer 1

列的顺序

使用正确的列序列创建partial multicolumn index。你有一个：

"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)

但是列的顺序并不能很好地为你服务。扭转它：

CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;

这里的经验法则是：首先是平等，后面是范围。更多关于这一点：
Multicolumn index and performance

正如所讨论的，条件WHERE response_time IS NOT NULL并没有给你带来太多帮助。如果您有其他查询可以使用此索引，包括NULL中的response_time值，请删除它。另外，保持它。

您也可以删除其他现有索引。有关btree索引中列的顺序的更多信息：
Working of indexes in PostgreSQL

覆盖指数

如果表格中只需要response_time，那么这可能会快得多 - 如果您对表格的行没有大量的写入操作。在最后一个位置包含索引中的列以允许index-only scans（使其成为＆＃34;覆盖索引＆＃34;）：

CREATE INDEX idx_pings_monitor_created
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL;  -- maybe

或者，你试试这个......

更激进的部分指数

创建一个小辅助函数。有效的＆＃34;全局常数＆＃34;在您的数据库中：

CREATE OR REPLACE FUNCTION f_ping_event_horizon()
  RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$;  -- One month in the past

在索引中将其用作条件：

CREATE INDEX idx_pings_monitor_created_response_time
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL  -- maybe
AND   created_at > f_ping_event_horizon();

您的查询现在看起来像这样：

SELECT response_time
FROM   pings
WHERE  monitor_id = 3
AND    response_time IS NOT NULL
AND    created_at > '2014-03-03 20:23:07.254281'
AND    created_at > f_ping_event_horizon();

除此之外：我调整了一些噪音。

最后一个条件在逻辑上似乎是多余的。只包括它，如果Postgres不理解它可以使用没有它的索引。可能是必要的。条件中的实际时间戳必须大于函数中的时间戳。但根据你的评论，情况显而易见。

这样我们剪切了所有不相关的行并使索引更小。效果会随着时间的推移而缓慢降低。重新调整事件视界并不时重新创建索引以消除增加的重量。例如，您可以使用每周一次的cron作业。

更新（重新创建）函数时，需要以任何方式重新创建使用该函数的所有索引。最好在同一笔交易中。因为帮助函数的 IMMUTABLE 声明有点虚假的承诺。但Postgres只接受索引定义中的不可变函数。所以我们不得不撒谎。更多关于这一点：
Does PostgreSQL support "accent insensitive" collations?

为什么功能呢？这样，使用索引的所有查询都可以保持不变。

通过所有这些更改，查询应该现在快几个数量级。只需一次连续的索引扫描即可。你能证实吗？

优化和/或索引此查询的正确方法是什么？

1 个答案:

列的顺序

覆盖指数

更激进的部分指数