Question

我有一个简单的查询，例如select * from xxx where col is not null limit 10。我不知道为什么Postgres更喜欢seq扫描，它比部分索引要慢得多（我已经分析了表格）。如何调试这样的问题？

该表有超过400万行。大约350,000行满足了pid is not null。

我认为成本估算可能有问题。 seq扫描的成本低于索引扫描。但是如何深入研究呢？

我有一个猜测，但不确定。非空行约占总行的10％。这意味着当seq扫描100行时，它可能会得到10行，这些行不为空。并且它认为seq扫描100行的成本低于索引扫描10行，然后随机获取10整行的成本。是吗？

> \d data_import
+--------------------+--------------------------+----------------------------------------------------------------------------+
| Column             | Type                     | Modifiers                                                                  |
|--------------------+--------------------------+----------------------------------------------------------------------------|
| id                 | integer                  |  not null default nextval('data_import_id_seq'::regclass) |
| name               | character varying(64)    |                                                                            |
| market_activity_id | integer                  |  not null                                                                  |
| hmsr_id            | integer                  |  not null default (-1)                                                     |
| site_id            | integer                  |  not null default (-1)                                                     |
| hmpl_id            | integer                  |  not null default (-1)                                                     |
| hmmd_id            | integer                  |  not null default (-1)                                                     |
| hmci_id            | integer                  |  not null default (-1)                                                     |
| hmkw_id            | integer                  |  not null default (-1)                                                     |
| creator_id         | integer                  |                                                                            |
| created_at         | timestamp with time zone |                                                                            |
| updated_at         | timestamp with time zone |                                                                            |
| bias               | integer                  |                                                                            |
| pid                | character varying(128)   |  default NULL::character varying                                           |
+--------------------+--------------------------+----------------------------------------------------------------------------+
Indexes:
    "data_import_pkey" PRIMARY KEY, btree (id)
    "unique_hmxx" UNIQUE, btree (site_id, hmsr_id, hmpl_id, hmmd_id, hmci_id, hmkw_id) WHERE pid IS NULL
    "data_import_pid_idx" UNIQUE, btree (pid) WHERE pid IS NOT NULL
    "data_import_created_at_idx" btree (created_at)
    "data_import_hmsr_id" btree (hmsr_id)
    "data_import_updated_at_idx" btree (updated_at)



> set enable_seqscan to false;
apollon> explain (analyse, verbose)  select * from data_import where pid is not null limit 10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
| QUERY PLAN
|-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Limit  (cost=0.42..5.68 rows=10 width=84) (actual time=0.059..0.142 rows=10 loops=1)
|   Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
|   ->  Index Scan using data_import_pid_idx on public.data_import  (cost=0.42..184158.08 rows=350584 width=84) (actual time
|         Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
|         Index Cond: (data_import.pid IS NOT NULL)
| Planning time: 0.126 ms
| Execution time: 0.177 ms
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN
Time: 0.054s

> set enable_seqscan to true;
> explain (analyse, verbose)  select * from data_import where pid is not null limit 10
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                        |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=0.00..2.37 rows=10 width=84) (actual time=407.042..407.046 rows=10 loops=1)                                                          |
|   Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid       |
|   ->  Seq Scan on public.data_import  (cost=0.00..83016.60 rows=350584 width=84) (actual time=407.041..407.045 rows=10 loops=1)  |
|         Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
|         Filter: (data_import.pid IS NOT NULL)                                                                                    |
|         Rows Removed by Filter: 3672502                                                                                                           |
| Planning time: 0.116 ms                                                                                                                           |
| Execution time: 407.078 ms                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.426s

Answer 1

您的问题是

Rows Removed by Filter: 3672502

PostgreSQL知道值的分布以及它们与物理表布局的关系，但是它不知道表开头的行对于pid都为NULL。

如果NULL分布均匀，则顺序扫描将迅速找到10个匹配项并停止，但实际上，它必须读取3672512行才能找到10个匹配的匹配项。

如果在ORDER BY pid之前添加LIMIT（即使不需要），则优化器将做正确的事情。

为什么Postgres更喜欢seq扫描而不是带有显式where条件的部分索引？

1 个答案: