Question

我有一个包含几百万行的表，名为item，其列如下所示：

CREATE TABLE item (
  id bigint NOT NULL,
  company_id bigint NOT NULL,
  date_created timestamp with time zone,
  ....
)

company_id有一个索引

CREATE INDEX idx_company_id ON photo USING btree (company_id);

通常会在此表中搜索某个客户的最后10个项目，即

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

目前，有一个客户占该表中大约75％的数据，另外25％的数据分布在25个左右的其他客户中，这意味着75％的行的公司ID为5，其他行的公司ID在6到25之间。

对于除了主要公司（id = 5）之外的所有公司，查询通常运行得非常快。我可以理解为什么因为company_id上的索引可以用于除5之外的公司。

我已尝试使用不同的索引来使这种搜索对公司5更有效。看起来最有意义的是

CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);

如果我添加此索引，对主要公司（id = 5）的查询会大大改善，但所有其他公司的查询都会变废话。

针对公司ID 5＆amp ;;的EXPLAIN ANALYZE的一些结果6有和没有新索引：

公司标识号

在新索引之前

QUERY PLAN
Limit  (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
  ->  Sort  (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
        Sort Key: photo_created
        Sort Method: top-N heapsort  Memory: 35kB
        ->  Seq Scan on photo  (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
              Filter: (company_id = 5)
              Rows Removed by Filter: 402513
Total runtime: 10482.075 ms

新索引之后：

QUERY PLAN
Limit  (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
        Filter: (company_id = 5)
        Rows Removed by Filter: 26
Total runtime: 0.164 ms

公司ID 6

在新索引之前：

QUERY PLAN
Limit  (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
  ->  Sort  (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
        Sort Key: photo_created
        Sort Method: quicksort  Memory: 28kB
        ->  Index Scan using idx_photo__company_id on photo  (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
              Index Cond: (company_id = 6)
Total runtime: 0.100 ms

新索引之后：

QUERY PLAN
Limit  (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
        Filter: (company_id = 6)
        Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms

我在桌面上运行了一个完整的VACUUM和ANALYZE，所以PostgreSQL应该有最新的统计数据。任何想法如何让PostgreSQL为被查询的公司选择正确的索引？

Answer 1

这被称为"abort-early plan problem"，并且多年来一直是一种长期的错误优化。中止早期的计划在他们工作时是惊人的，但是当他们不工作时很糟糕;请参阅该链接邮件列表主题以获取更详细的说明。基本上，规划人员认为在没有扫描整个date_created索引的情况下，它会为客户6找到你想要的10行，这是错误的。

在PostgreSQL 10（不是测试版）之前，没有任何硬性和快速的方法可以明确地改进这个查询。您想要做的是以各种方式推动查询规划器，以期获得您想要的东西。主要方法包括使PostgreSQL更可能使用多列索引的任何内容，例如：

降低random_page_cost（如果您使用固态硬盘，无论如何都是个好主意。）
降低cpu_index_tuple_cost

您也可以通过播放表格统计信息来修复计划程序行为。这包括：

为表格提升statistics_target并再次运行ANALYZE，以使PostgreSQL获取更多样本并更好地了解行分布;
在统计信息中增加n_distinct以准确反映customer_id的数量或不同的created_dates。

但是，所有这些解决方案都是近似的，如果查询性能随着您将来的数据更改而变得很糟糕，那么这应该是您查看的第一个查询。

在PostgreSQL 10中，您将能够创建Cross-Column Stats，这样可以更可靠地改善情况。根据这对您的破坏程度，您可以尝试使用测试版。

如果这些都不起作用，我建议在Freenode上使用#postgresql IRC频道或pgsql-performance mailing list。那些人会询问你的详细表格统计数据，以便提出一些建议。

Answer 2

还有一点：为什么要创建索引

CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);

但请致电：

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

可能是你的意思

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;

最好还是创建组合索引：

CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);

之后：

                                                                     QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1)
         Index Cond: (company_id = 5)
         Heap Fetches: 10
 Planning time: 1.003 ms
 Execution time: 0.209 ms
(6 rows)
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1)
         Index Cond: (company_id = 6)
         Heap Fetches: 10
 Planning time: 0.136 ms
 Execution time: 0.155 ms
(6 rows)

在您的服务器上，它可能会慢一点，但无论如何都比上面的例子好得多。

没有使用PostgresQL索引

2 个答案: