Question

请观察：

（忘记添加订单，计划已更新）

查询：

EXPLAIN ANALYZE
SELECT DISTINCT(id), special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
JOIN invoiceItems items ON items.invoice_id = o.id 
WHERE items.qty < 5
ORDER BY modified
LIMIT 100

结果：

Limit  (cost=2931740.10..2931747.85 rows=100 width=635) (actual time=414307.004..414387.899 rows=100 loops=1)
  ->  Unique  (cost=2931740.10..3076319.37 rows=1865539 width=635) (actual time=414307.001..414387.690 rows=100 loops=1)
        ->  Sort  (cost=2931740.10..2936403.95 rows=1865539 width=635) (actual time=414307.000..414325.058 rows=2956 loops=1)
              Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text)), o.id, o.special, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, (...)
              Sort Method: external merge  Disk: 537240kB
              ->  Hash Join  (cost=11579.63..620479.38 rows=1865539 width=635) (actual time=1535.805..131378.864 rows=1872673 loops=1)
                    Hash Cond: (items.invoice_id = o.id)
                    ->  Seq Scan on invoiceitems items  (cost=0.00..78363.45 rows=1865539 width=4) (actual time=0.110..4591.117 rows=1872673 loops=1)
                          Filter: (qty < 5)
                          Rows Removed by Filter: 1405763
                    ->  Hash  (cost=5498.18..5498.18 rows=64996 width=635) (actual time=1530.786..1530.786 rows=64996 loops=1)
                          Buckets: 1024  Batches: 64  Memory Usage: 598kB
                          ->  Hash Left Join  (cost=113.02..5498.18 rows=64996 width=635) (actual time=0.214..1043.207 rows=64996 loops=1)
                                Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
                                ->  Hash Left Join  (cost=75.35..4566.81 rows=64996 width=607) (actual time=0.154..754.957 rows=64996 loops=1)
                                      Hash Cond: (o.terms_id = terms.terms_id)
                                      ->  Hash Left Join  (cost=37.67..3800.33 rows=64996 width=579) (actual time=0.071..506.145 rows=64996 loops=1)
                                            Hash Cond: (o.rep_id = reps.rep_id)
                                            ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.010..235.977 rows=64996 loops=1)
                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.044..0.044 rows=4 loops=1)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                                  ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.067..0.067 rows=3 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                            ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.001..0.007 rows=3 loops=1)
                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.043..0.043 rows=4 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                      ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
Total runtime: 414488.582 ms

这显然很可怕。我对解释查询计划很陌生，并且想知道如何从这样的计划中提取有用的性能改进提示。

编辑1

此查询涉及两种实体 - 具有1-many关系的发票和发票项目。
发票项目指定父发票中的数量。
给定查询返回100张发票，这些发票至少有一个数量小于5的商品。

这应该解释为什么我需要DISTINCT - 发票可能有多个项目满足过滤器，但我不希望多次返回相同的发票。因此使用DISTINCT。但是，我完全清楚，使用DISTINCT可能有更好的方法来完成相同的语义 - 我非常愿意了解它们。

编辑2

请在查询时查找invoiceItems表上的索引：

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);
CREATE INDEX invoiceitems_invoice_id_name_index ON invoiceitems (invoice_id, name varchar_pattern_ops);
CREATE INDEX invoiceitems_name_index ON invoiceitems (name varchar_pattern_ops);
CREATE INDEX invoiceitems_qty_index ON invoiceitems (qty);

编辑3

https://stackoverflow.com/users/808806/yieldsfalsehood给出的关于如何消除DISTINCT（及其原因）的建议结果非常好。这是新查询：

EXPLAIN ANALYZE
SELECT id, special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
WHERE EXISTS (SELECT 1 FROM invoiceItems items WHERE items.invoice_id = id AND items.qty < 5)
ORDER BY modified DESC
LIMIT 100

这是新计划：

Limit  (cost=64717.14..64717.39 rows=100 width=635) (actual time=7830.347..7830.869 rows=100 loops=1)
  ->  Sort  (cost=64717.14..64827.01 rows=43949 width=635) (actual time=7830.334..7830.568 rows=100 loops=1)
        Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text))
        Sort Method: top-N heapsort  Memory: 76kB
        ->  Hash Left Join  (cost=113.46..63037.44 rows=43949 width=635) (actual time=2.322..6972.679 rows=64467 loops=1)
              Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
              ->  Hash Left Join  (cost=75.78..50968.72 rows=43949 width=607) (actual time=0.650..3809.276 rows=64467 loops=1)
                    Hash Cond: (o.terms_id = terms.terms_id)
                    ->  Hash Left Join  (cost=38.11..50438.25 rows=43949 width=579) (actual time=0.550..3527.558 rows=64467 loops=1)
                          Hash Cond: (o.rep_id = reps.rep_id)
                          ->  Nested Loop Semi Join  (cost=0.43..49796.28 rows=43949 width=551) (actual time=0.015..3200.735 rows=64467 loops=1)
                                ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.002..317.954 rows=64996 loops=1)
                                ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.030..0.030 rows=1 loops=64996)
                                      Index Cond: (invoice_id = o.id)
                                      Filter: (qty < 5)
                                      Rows Removed by Filter: 1
                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.213..0.213 rows=4 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.183..0.192 rows=4 loops=1)
                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.063..0.063 rows=3 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 1kB
                          ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.044..0.050 rows=3 loops=1)
              ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.096..0.096 rows=4 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.071..0.079 rows=4 loops=1)
Total runtime: 7832.750 ms

这是我能指望的最好的吗？我已重新启动服务器（以清理数据库缓存）并在不使用EXPLAIN ANALYZE的情况下重新运行查询。这需要将近5秒钟。它可以进一步改善吗？我有65,000张发票和3,278,436张发票项目。

编辑4

找到它。我按计算结果modified = to_char(modified, 'YYYY-MM-DD HH24:MI:SS')排序。在修改后的发票字段上添加索引并按字段本身排序会使结果低于100毫秒！

最终计划是：

Limit  (cost=1.18..1741.92 rows=100 width=635) (actual time=3.002..27.065 rows=100 loops=1)
  ->  Nested Loop Left Join  (cost=1.18..765042.09 rows=43949 width=635) (actual time=2.989..25.989 rows=100 loops=1)
        ->  Nested Loop Left Join  (cost=1.02..569900.41 rows=43949 width=607) (actual time=0.413..16.863 rows=100 loops=1)
              ->  Nested Loop Left Join  (cost=0.87..386185.48 rows=43949 width=579) (actual time=0.333..15.694 rows=100 loops=1)
                    ->  Nested Loop Semi Join  (cost=0.72..202470.54 rows=43949 width=551) (actual time=0.017..13.965 rows=100 loops=1)
                          ->  Index Scan Backward using invoices_modified_index on invoices o  (cost=0.29..155543.23 rows=64996 width=551) (actual time=0.003..4.543 rows=100 loops=1)
                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.079..0.079 rows=1 loops=100)
                                Index Cond: (invoice_id = o.id)
                                Filter: (qty < 5)
                                Rows Removed by Filter: 1
                    ->  Index Scan using reps_pkey on reps  (cost=0.15..4.17 rows=1 width=36) (actual time=0.007..0.008 rows=1 loops=100)
                          Index Cond: (rep_id = o.rep_id)
              ->  Index Scan using terms_pkey on terms  (cost=0.15..4.17 rows=1 width=36) (actual time=0.003..0.004 rows=1 loops=100)
                    Index Cond: (terms_id = o.terms_id)
        ->  Index Scan using shipvia_pkey on shipvia  (cost=0.15..4.17 rows=1 width=36) (actual time=0.006..0.008 rows=1 loops=100)
              Index Cond: (ship_via_id = o.ship_via_id)
Total runtime: 27.572 ms

太神奇了！谢谢大家的帮助。

Answer 1

对于初学者来说，将解释计划发布到http://explain.depesz.com是非常标准的 - 它会为它添加一些漂亮的格式，为您提供一种分发计划的好方法，并让您匿名化可能包含敏感数据的计划。即使您没有分发计划，也可以更容易理解正在发生的事情，有时可以准确地说明瓶颈在哪里。

有无数的资源可以解释postgres解释计划的细节（参见https://wiki.postgresql.org/wiki/Using_EXPLAIN）。在数据库选择计划时会考虑很多细节，但有一些一般概念可以使它更容易。首先，掌握基于页面的数据和索引布局（您不需要知道页面格式的细节，只需知道数据和索引如何分成页面）。从那里，感受两种基本的数据访问方法 - 全表扫描和索引扫描 - 并且稍加思考它应该开始变得清楚不同的情况，其中一个将优先于另一个（同时请记住，甚至不总是可以进行索引扫描。此时，您可以开始查看影响计划选择的一些不同配置项，这些配置项可能会影响计划选择，以支持表扫描或索引扫描。

一旦你做到了这一点，继续前进计划并阅读你找到的不同节点的细节 - 在这个计划中你有很多散列连接，所以请继续阅读。然后，为了比较苹果和苹果，完全禁用散列连接（“set enable_hashjoin = false;”）并再次运行解释分析。现在你看到什么连接方法？阅读它。将该方法的估计成本与散列连接的估计成本进行比较。他们为什么会有所不同？第二个计划的估计成本将高于第一个计划（否则它本来是首选）但是运行第二个计划需要的实际时间呢？是低还是高？

最后，具体解决这个计划。关于那种花费很长时间的那种：明显不是一种功能。 “DISTINCT（id）”并没有说“给我所有只在列id上有区别的行”，而是对行进行排序并根据输出中的所有列获取唯一值（即它等同于写入“不同的身份......”）。你可能应该重新考虑一下你是否真的需要那种截然不同的东西。归一化将倾向于设计出区分的需要，虽然偶尔需要它们，但是它们是否真的非常真实需要并不总是正确的。

Answer 2

首先追逐花费时间最长的节点，然后开始优化。在您的情况下，似乎是

Seq Scan on invoiceitems items

你应该在那里添加一个索引，并且还要对其他表添加问题。

您还可以尝试增加work_mem以摆脱外部排序。

当你这样做时，新计划看起来可能完全不同，所以重新开始。

如何解释以下PostgreSQL查询计划

2 个答案: