Question

我有一个相对较大的表（大约一百万条记录），包含以下列：

帐号：字符变化（36）不为空
group：character varying（255）not null
分类：字符变化（255）不为空
size：integer not null

该帐户在实践中是一个UUID，但我认为这并不重要。如果我执行以下简单查询，我的机器上大约需要16秒：

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

到目前为止一切顺利。假设我添加了一个索引：

CREATE INDEX concurrently ON mytable (account, group, classification);

如果我再次执行相同的查询，它现在可以在不到半秒的时间内返回结果。解释查询还清楚地表明使用了索引。

但是，如果我将查询改为

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

再次花费16秒，不再使用索引。在我看来，分组标准的顺序并不重要，但我不是专家。知道为什么Postgresql不能（或者没有）优化后一个查询。我在Postgresql 9.4中试过这个。

编辑：根据要求，这是解释的输出。对于索引呼叫：

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

对于更改了groupby条件顺序的呼叫：

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

Answer 1

无论列在GROUP BY子句中出现的顺序如何，结果都是相同的，并且可以使用相同的执行计划。

PostgreSQL优化器不考虑重新排序GROUP BY表达式，以查看不同的排序是否与现有索引匹配。

这是一个限制，您可以询问pgsql-hackers列表是否需要增强功能。您可以使用实现所需功能的补丁来备份它。

但是，我不确定是否会接受这样的改进。这种增强的缺点是优化器必须更多地工作，这将影响使用GROUP BY子句的所有查询的规划时间。此外，解决此限制很容易：只需重写查询并更改GROUP BY表达式的顺序。所以我会说事情应该保持原样。

Answer 2

实际上，GROUP BY子句中的列顺序确实会影响结果。默认情况下，结果将按GROUP BY中的列进行排序。如果您设置自己的ORDER BY，则结果和索引用法将相同。

演示：

CREATE TABLE coconuts (
  mass int,
  volume int,
  loveliness int
);

INSERT INTO coconuts (mass, volume, loveliness)
  SELECT (random() * 5)::int
       , (random() * 5)::int
       , (random() * 1000 + 9000)::int
  FROM GENERATE_SERIES(1,10000000);

请注意GROUP BY中列的顺序如何影响顺序：

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    0 |      1 | 10000
    0 |      2 | 10000
...

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    1 |      0 | 10000
    2 |      0 | 10000
...

以及它如何影响查询计划：

CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (mass, volume);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)


EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (volume, mass);
                                            QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001658532.83..10001758932.83 rows=40000 width=12)
   Group Key: volume, mass
   ->  Sort  (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
         Sort Key: volume, mass
         ->  Seq Scan on coconuts  (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)

但是，如果您将ORDER BY与原始GROUP BY匹配，则原始查询计划至少在postgres 11.5中返回。

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY volume, mass
  ORDER BY mass, volume;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)

为什么索引中的列顺序对于Postgresql中的组很重要？

2 个答案: