为什么索引中的列顺序对于Postgresql中的组很重要?

时间:2016-12-23 16:17:29

标签: postgresql indexing group-by

我有一个相对较大的表(大约一百万条记录),包含以下列:

  • 帐号:字符变化(36)不为空
  • group:character varying(255)not null
  • 分类:字符变化(255)不为空
  • size:integer not null

该帐户在实践中是一个UUID,但我认为这并不重要。如果我执行以下简单查询,我的机器上大约需要16秒:

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

到目前为止一切顺利。假设我添加了一个索引:

CREATE INDEX concurrently ON mytable (account, group, classification);

如果我再次执行相同的查询,它现在可以在不到半秒的时间内返回结果。解释查询还清楚地表明使用了索引。

但是,如果我将查询改为

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

再次花费16秒,不再使用索引。在我看来,分组标准的顺序并不重要,但我不是专家。知道为什么Postgresql不能(或者没有)优化后一个查询。我在Postgresql 9.4中试过这个。

编辑:根据要求,这是解释的输出。对于索引呼叫:

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

对于更改了groupby条件顺序的呼叫:

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

2 个答案:

答案 0 :(得分:2)

无论列在GROUP BY子句中出现的顺序如何,结果都是相同的,并且可以使用相同的执行计划。

PostgreSQL优化器不考虑重新排序GROUP BY表达式,以查看不同的排序是否与现有索引匹配。

这是一个限制,您可以询问pgsql-hackers列表是否需要增强功能。您可以使用实现所需功能的补丁来备份它。

但是,我不确定是否会接受这样的改进。这种增强的缺点是优化器必须更多地工作,这将影响使用GROUP BY子句的所有查询的规划时间。此外,解决此限制很容易:只需重写查询并更改GROUP BY表达式的顺序。所以会说事情应该保持原样。

答案 1 :(得分:2)

实际上,GROUP BY子句中的列顺序确实会影响结果。默认情况下,结果将按GROUP BY中的列进行排序。如果您设置自己的ORDER BY,则结果和索引用法将相同。

演示:

CREATE TABLE coconuts (
  mass int,
  volume int,
  loveliness int
);

INSERT INTO coconuts (mass, volume, loveliness)
  SELECT (random() * 5)::int
       , (random() * 5)::int
       , (random() * 1000 + 9000)::int
  FROM GENERATE_SERIES(1,10000000);

请注意GROUP BY中列的顺序如何影响顺序:

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    0 |      1 | 10000
    0 |      2 | 10000
...

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    1 |      0 | 10000
    2 |      0 | 10000
...

以及它如何影响查询计划:

CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (mass, volume);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)


EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (volume, mass);
                                            QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001658532.83..10001758932.83 rows=40000 width=12)
   Group Key: volume, mass
   ->  Sort  (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
         Sort Key: volume, mass
         ->  Seq Scan on coconuts  (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)

但是,如果您将ORDER BY与原始GROUP BY匹配,则原始查询计划至少在postgres 11.5中返回。

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY volume, mass
  ORDER BY mass, volume;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)