Question

我们有一个包含4.5亿行的数据库，其结构如下：

   uid      id_1     id_2   d1  d2  d3  d4  d5  d6  d7  d8  d9  d10 d11 d12 d13 d14 d15 d16 d17
81038392    5655067 5468882 373 117 185 152 199 173 168 138 185 159 154 34  38  50  34  41  57
81038393    5655067 5468883 374 116 184 118 170 143 144 113 164 137 138 37  39  53  37  42  60
81038394    5655067 5468884 371 118 187 118 170 143 144 105 157 131 136 32  35  47  32  39  53
81038395    5655067 5468885 370 116 184 118 170 143 144 105 157 131 136 31  35  46  31  38  53
81038396    5655067 5468886 370 117 185 118 170 143 144 105 157 131 136 29  34  44  29  37  50
81038397    5655067 5470853 368 117 185 110 163 137 140 105 157 131 136 34  36  48  34  39  55
81038398    5655067 5470854 372 119 188 118 170 143 144 113 164 137 138 34  36  49  34  40  55
81038399    5655067 5470855 360 115 182 103 151 131 136 98  145 125 131 30  34  45  30  38  51
81038400    5655067 5470856 357 112 177 103 151 131 136 98  145 125 131 30  34  45  30  37  51
81038401    5655067 5470857 356 111 176 103 151 131 136 98  145 125 131 28  33  43  28  36  50
81038402    5655067 5470858 358 113 179 103 151 131 136 98  145 125 131 31  35  46  31  38  52
81038403    5655067 5472811 344 109 173 152 199 173 168 138 185 159 154 31  36  46  31  39  52
81038404    5655068 5468882 373 117 185 152 199 173 168 138 185 159 154 34  38  50  34  41  57
81038405    5655068 5468883 374 116 184 118 170 143 144 113 164 137 138 37  39  53  37  42  60
81038406    5655068 5468884 371 118 187 118 170 143 144 105 157 131 136 32  35  47  32  39  53
81038407    5655068 5468885 370 116 184 118 170 143 144 105 157 131 136 31  35  46  31  38  53
81038408    5655068 5468886 370 117 185 118 170 143 144 105 157 131 136 29  34  44  29  37  50
81038409    5655068 5470853 368 117 185 110 163 137 140 105 157 131 136 34  36  48  34  39  55
81038410    5655068 5470854 372 119 188 118 170 143 144 113 164 137 138 34  36  49  34  40  55
81038411    5655068 5470855 360 115 182 103 151 131 136 98  145 125 131 30  34  45  30  38  51
81038412    5655068 5470856 357 112 177 103 151 131 136 98  145 125 131 30  34  45  30  37  51
81038413    5655068 5470857 356 111 176 103 151 131 136 98  145 125 131 28  33  43  28  36  50
81038414    5655068 5470858 358 113 179 103 151 131 136 98  145 125 131 31  35  46  31  38  52

我们需要不断进行如下查询：

查询1：

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067;

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using id_1_idx on mytable (cost=0.57..99187.68 rows=25742 width=80) (actual time=47.081..2600.899 rows=21487 loops=1)
   Index Cond: (id_1 = 5655067)
   Buffers: shared hit=9 read=4816
   I/O Timings: read=2563.181
 Planning time: 0.151 ms
 Execution time: 2602.320 ms
(6 rows)

查询2：

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_2 = 5670433;

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on mytable (cost=442.02..89887.42 rows=23412 width=80) (actual time=113.200..42127.512 rows=21487 loops=1)
   Recheck Cond: (id_2 = 5670433)
   Heap Blocks: exact=16988
   Buffers: shared hit=30 read=17020
   I/O Timings: read=41971.798
   ->  Bitmap Index Scan on id_2_idx  (cost=0.00..436.16 rows=23412 width=0) (actual time=104.928..104.929 rows=21487 loops=1)
         Index Cond: (id_2 = 5670433)
         Buffers: shared hit=2 read=60
         I/O Timings: read=99.235
 Planning time: 0.163 ms
 Execution time: 42132.556 ms
(11 rows)

大约有23000至25000 id_1和id_2值，两个查询将始终返回大约24000行数据。我们仅读取数据，并且数据不会随时间变化。

问题：

查询1大约需要3秒钟，虽然有点长，但仍然可以忍受。
查询2最多需要30-40秒，这对我们来说是太多了，因为该服务是交互式Web服务。

我们已索引id_1和id_2。我们还在id_1和id_2上添加了联合索引，这是数据所在的Azure PostgreSQL即服务平台建议的。它没有帮助。

我的假设是查询1的速度很快，因为所有行都按顺序位于数据库中，而当使用查询2时，这些行始终不按顺序分布在整个数据库中。

重组数据以加快Query 2的速度不是一个好主意，因为这会降低Query 1的性能。我知道该数据的结构方式并不理想，但是我无法控制它。有什么建议可以将Query 2加快到合理的水平吗？

编辑2：

创建索引语句：

CREATE INDEX id_1_idx ON mytable (id_1);
CREATE INDEX id_2_idx ON mytable (id_2);

抽空桌子并没有改变计划。抽真空后，EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE id_1 = 5655067的输出非常相似。这是详细真空的输出：

VACUUM (VERBOSE, ANALYZE) mytable;

INFO:  vacuuming "public.mytable"
INFO:  index "mytable_pkey" now contains 461691169 row versions in 1265896 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2695.21 s.
INFO:  index "id_1_idx" now contains 461691169 row versions in 1265912 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1493.20 s.
INFO:  index "id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 1296.06 s.
INFO:  index "mytable_id_1_id_2_idx" now contains 461691169 row versions in 1265912 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 2364.16 s.
INFO:  "mytable": found 0 removable, 389040319 nonremovable row versions in 5187205 out of 6155883 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 12767
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 13560.60 s.
INFO:  analyzing "public.mytable"
INFO:  "mytable": scanned 30000 of 6155883 pages, containing 2250000 live rows and 0 dead rows; 30000 rows in sample, 461691225 estimated total rows
VACUUM

Answer 1

TL; DR

存储I / O是您的主要瓶颈+没有足够的RAM用于索引，因为您可以自己计算一下：

对于位图堆扫描，您可以计算出平均块读取延迟约为2.5毫秒（在41971.798毫秒中读取了17020个块），这太慢了。

避免磁盘读取的唯一方法是大量RAM。更快的存储将使系统更具可伸缩性，因为这很可能不是查询的唯一类型，也不是数据库中唯一的表。

长版：

读取EXPLAIN的理想输出结果表明，计划者进行的成本评估远未完成，并且性能下降来自磁盘读取。

正如您所写，数据不会随时间变化（因此，您事先知道了值范围），您还可以将表范围划分为这两列，然后只需扫描特定分区（使用较小的索引，读取较小的表堆）。但是，如果访问此数据的应用程序最终或多或少会访问整个数据范围，那么这样做也无济于事。

因此，您应该考虑更换存储子系统，以便能够在应用程序具有的性能要求内处理您的查询。

我怀疑PostgreSQL服务器仍在HDD而不是SSD上运行。仅有1.2亿行的测试显示了两个索引的以下特征：

create table nums (uid integer primary key, id_1 integer, id_2 integer, d1 integer, d2 integer, d3 integer, d4 integer, d5 integer, d6 integer, d7 integer, d8 integer, d9 integer, d10 integer, d11 integer, d12 integer, d13 integer, d14 integer, d15 integer, d16 integer, d17 integer);

INSERT INTO nums select generate_series(80000001, 200000000) AS uid, (random() * 23000)::integer + 5600000 AS id_1, (random() * 25000)::integer + 5600000 AS id_2, (random() * 1000)::integer AS d1, (random() * 1000)::integer AS d2, (random() * 1000)::integer AS d3, (random() * 1000)::integer AS d4, (random() * 1000)::integer AS d5, (random() * 1000)::integer AS d6, (random() * 1000)::integer AS d7, (random() * 1000)::integer AS d8, (random() * 1000)::integer AS d9, (random() * 1000)::integer AS d10, (random() * 1000)::integer AS d11, (random() * 100)::integer AS d12, (random() * 100)::integer AS d13, (random() * 100)::integer AS d14, (random() * 100)::integer AS d15, (random() * 100)::integer AS d16, (random() * 100)::integer AS d17;

create index id_1_idx on nums (id_1);
create index id_2_idx on nums (id_2);
cluster nums using id_1_idx;

...导致以下结果（两次冷读）：

explain (analyze, buffers) select * from nums where id_1 = 5606001;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using id_1_idx on nums  (cost=0.57..5816.92 rows=5198 width=80) (actual time=1.680..6.394 rows=5185 loops=1)
   Index Cond: (id_1 = 5606001)
   Buffers: shared read=88
   I/O Timings: read=4.397
 Planning Time: 4.002 ms
 Execution Time: 7.475 ms
(6 rows)

Time: 15.924 ms

...以及id_2：

explain (analyze, buffers) select * from nums where id_2 = 5606001; 
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Index Scan using id_2_idx on nums  (cost=0.57..5346.53 rows=4777 width=80) (actual time=0.376..985.689 rows=4748 loops=1)
   Index Cond: (id_2 = 5606001)
   Buffers: shared hit=1 read=4755
   I/O Timings: read=972.555
 Planning Time: 0.203 ms
 Execution Time: 986.590 ms
(6 rows)

Time: 987.296 ms

因此，尽管我的表“只是” 12 GiB + 3x 2.5 GiB（PK + 2索引）仍然足够快。

如果服务器已经在SSD上运行，请确保（物理上）分离WAL /日志，表数据（表空间），索引（表空间）的数据存储，以尽可能地从并行性中受益并减少I同一系统上其他服务/应用程序引起的/ O干扰。

还要考虑一个服务器系统，该系统具有更多的表和索引数据存储空间（对于这个〜48 GiB表+每个索引〜10 GiB，假设所有integer列），然后进行预热将数据从磁盘推入内存。至少索引应该能够完全保留在内存中。

编辑：我的服务器不使用位图（索引+堆）扫描的原因是因为我在SSD上运行，并且已将随机页面成本从默认的4降低到1.1。对于HDD系统，这毫无意义。

EDIT＃2 ：对情况的重新测试显示了一个有趣的行为：

在测试中，我假设第一列uid是主键列，并且是serial（顺序整数），根据这些列，条目首先在磁盘上排序。在生成数据时，会随机生成两个有趣的索引列id_1和id_2的值，通常对于大表来说这是最坏的情况。

但是，在这种情况下并非如此。创建测试数据和索引并分析表之后，但在之前，使用列id_1上的索引对数据进行重新排序，我现在得到以下结果：

explain (analyze, buffers) select * from nums where id_1 = 5606001;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on nums  (cost=63.32..7761.68 rows=5194 width=80) (actual time=1.978..41.007 rows=5210 loops=1)
   Recheck Cond: (id_1 = 5606001)
   Heap Blocks: exact=5198
   Buffers: shared read=5217
   I/O Timings: read=28.732
   ->  Bitmap Index Scan on id_1_idx  (cost=0.00..62.02 rows=5194 width=0) (actual time=1.176..1.176 rows=5210 loops=1)
         Index Cond: (id_1 = 5606001)
         Buffers: shared read=19
         I/O Timings: read=0.124
 Planning Time: 7.214 ms
 Execution Time: 41.419 ms
(11 rows)

...并且：

explain (analyze, buffers) select * from nums where id_2 = 5606001;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on nums  (cost=58.52..7133.04 rows=4768 width=80) (actual time=7.305..43.830 rows=4813 loops=1)
   Recheck Cond: (id_2 = 5606001)
   Heap Blocks: exact=4805
   Buffers: shared hit=12 read=4810
   I/O Timings: read=28.181
   ->  Bitmap Index Scan on id_2_idx  (cost=0.00..57.33 rows=4768 width=0) (actual time=5.102..5.102 rows=4813 loops=1)
         Index Cond: (id_2 = 5606001)
         Buffers: shared read=17
         I/O Timings: read=2.414
 Planning Time: 0.227 ms
 Execution Time: 44.197 ms
(11 rows)

此处提供所有计划和优化：

我还遵循自己的最佳实践，在这里将索引分离到另一个物理SSD上的另一个表空间。

我们可以看到，要获取约5000个结果行，在两种情况下都必须使用位图堆扫描来读取或多或少相同数量的块。

在这种情况下，两列的相关性：

 attname | correlation | n_distinct 
---------+-------------+------------
 id_1    |  -0.0047043 |      23003
 id_2    |  0.00157998 |      25004

现在，在之后 CLUSTER ... USING id_1_idx和之后重新分析查询，从而产生以下相关性：

 attname | correlation  | n_distinct 
---------+--------------+------------
 id_1    |            1 |      22801
 id_2    | -0.000898521 |      24997

...显示了以下表演：

explain (analyze, buffers) select * from nums where id_1 = 5606001;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Index Scan using id_1_idx on nums  (cost=0.57..179.02 rows=5083 width=80) (actual time=2.604..5.256 rows=5210 loops=1)
   Index Cond: (id_1 = 5606001)
   Buffers: shared read=90
   I/O Timings: read=4.107
 Planning Time: 4.039 ms
 Execution Time: 5.563 ms
(6 rows)

...比预期的要好得多-但是：

explain (analyze, buffers) select * from nums where id_2 = 5606001;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on nums  (cost=58.57..7140.12 rows=4775 width=80) (actual time=5.866..99.707 rows=4813 loops=1)
   Recheck Cond: (id_2 = 5606001)
   Heap Blocks: exact=4806
   Buffers: shared read=4823
   I/O Timings: read=31.389
   ->  Bitmap Index Scan on id_2_idx  (cost=0.00..57.38 rows=4775 width=0) (actual time=2.992..2.992 rows=4813 loops=1)
         Index Cond: (id_2 = 5606001)
         Buffers: shared read=17
         I/O Timings: read=0.338
 Planning Time: 0.210 ms
 Execution Time: 100.155 ms
(11 rows)

尽管实际上必须读取与第一次随机运行几乎完全相同的块数，但

...要比慢两倍多。

为什么放慢这么快？

使用索引id_1_idx对表数据进行物理重新排序也影响了列的物理顺序。现在，位图堆扫描的目的是获取从位图索引扫描中按物理（磁盘）顺序读取的块列表。在第一种情况下（随机），很有可能多条匹配条件的行位于磁盘上连续块中，从而减少了对磁盘的随机访问。

有趣的是（但这可能只是因为我在SSD上运行），禁用位图扫描会显示可接受的数字：

explain (analyze, buffers) select * from nums where id_2 = 5606001;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Index Scan using id_2_idx on nums  (cost=0.57..7257.12 rows=4775 width=80) (actual time=0.151..35.453 rows=4813 loops=1)
   Index Cond: (id_2 = 5606001)
   Buffers: shared read=4823
   I/O Timings: read=30.051
 Planning Time: 1.927 ms
 Execution Time: 35.810 ms
(6 rows)

所有这些数字几乎都是完整的冷启动执行（如您所见，没有Buffers: shared hit数字，或者数字非常低。

有趣的是，id_2的位图扫描和索引扫描之间的I / O时序非常相似，但是位图扫描似乎在这里带来了巨大的开销。

Answer 2

区别在于id_1是高度相关的，即该列的顺序对应于行的物理顺序，而id_2是不相关的。

测试

SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'mytable'
  AND attname IN ('id_1', 'id_2');

如果相关性很高，则该列的单个值的行将在表的几个相邻块中。如果相关性很低，则行将遍及表，并且必须读取更多的块。

要获得高度相关性，可以使用CLUSTER语句重写表以对行进行重新排序。如果没有删除和更新，则将按照插入顺序对表进行物理排序。

您可以加快一个查询或另一个查询的速度，但不能同时加速两个查询。

PostgreSQL：一个索引快而另一个索引慢

2 个答案: