如何解释以下PostgreSQL查询计划

时间:2013-12-06 12:34:41

标签: postgresql sql-execution-plan

请观察:

忘记添加订单,计划已更新

查询:

EXPLAIN ANALYZE
SELECT DISTINCT(id), special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
JOIN invoiceItems items ON items.invoice_id = o.id 
WHERE items.qty < 5
ORDER BY modified
LIMIT 100

结果:

Limit  (cost=2931740.10..2931747.85 rows=100 width=635) (actual time=414307.004..414387.899 rows=100 loops=1)
  ->  Unique  (cost=2931740.10..3076319.37 rows=1865539 width=635) (actual time=414307.001..414387.690 rows=100 loops=1)
        ->  Sort  (cost=2931740.10..2936403.95 rows=1865539 width=635) (actual time=414307.000..414325.058 rows=2956 loops=1)
              Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text)), o.id, o.special, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, (...)
              Sort Method: external merge  Disk: 537240kB
              ->  Hash Join  (cost=11579.63..620479.38 rows=1865539 width=635) (actual time=1535.805..131378.864 rows=1872673 loops=1)
                    Hash Cond: (items.invoice_id = o.id)
                    ->  Seq Scan on invoiceitems items  (cost=0.00..78363.45 rows=1865539 width=4) (actual time=0.110..4591.117 rows=1872673 loops=1)
                          Filter: (qty < 5)
                          Rows Removed by Filter: 1405763
                    ->  Hash  (cost=5498.18..5498.18 rows=64996 width=635) (actual time=1530.786..1530.786 rows=64996 loops=1)
                          Buckets: 1024  Batches: 64  Memory Usage: 598kB
                          ->  Hash Left Join  (cost=113.02..5498.18 rows=64996 width=635) (actual time=0.214..1043.207 rows=64996 loops=1)
                                Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
                                ->  Hash Left Join  (cost=75.35..4566.81 rows=64996 width=607) (actual time=0.154..754.957 rows=64996 loops=1)
                                      Hash Cond: (o.terms_id = terms.terms_id)
                                      ->  Hash Left Join  (cost=37.67..3800.33 rows=64996 width=579) (actual time=0.071..506.145 rows=64996 loops=1)
                                            Hash Cond: (o.rep_id = reps.rep_id)
                                            ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.010..235.977 rows=64996 loops=1)
                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.044..0.044 rows=4 loops=1)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                                  ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.067..0.067 rows=3 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                            ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.001..0.007 rows=3 loops=1)
                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.043..0.043 rows=4 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                      ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
Total runtime: 414488.582 ms

这显然很可怕。我对解释查询计划很陌生,并且想知道如何从这样的计划中提取有用的性能改进提示。

编辑1

  • 此查询涉及两种实体 - 具有1-many关系的发票和发票项目。
  • 发票项目指定父发票中的数量。
  • 给定查询返回100张发票,这些发票至少有一个数量小于5的商品。

这应该解释为什么我需要DISTINCT - 发票可能有多个项目满足过滤器,但我不希望多次返回相同的发票。因此使用DISTINCT。但是,我完全清楚,使用DISTINCT可能有更好的方法来完成相同的语义 - 我非常愿意了解它们。

编辑2

请在查询时查找invoiceItems表上的索引:

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);
CREATE INDEX invoiceitems_invoice_id_name_index ON invoiceitems (invoice_id, name varchar_pattern_ops);
CREATE INDEX invoiceitems_name_index ON invoiceitems (name varchar_pattern_ops);
CREATE INDEX invoiceitems_qty_index ON invoiceitems (qty);

编辑3

https://stackoverflow.com/users/808806/yieldsfalsehood给出的关于如何消除DISTINCT(及其原因)的建议结果非常好。这是新查询:

EXPLAIN ANALYZE
SELECT id, special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
WHERE EXISTS (SELECT 1 FROM invoiceItems items WHERE items.invoice_id = id AND items.qty < 5)
ORDER BY modified DESC
LIMIT 100

这是新计划:

Limit  (cost=64717.14..64717.39 rows=100 width=635) (actual time=7830.347..7830.869 rows=100 loops=1)
  ->  Sort  (cost=64717.14..64827.01 rows=43949 width=635) (actual time=7830.334..7830.568 rows=100 loops=1)
        Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text))
        Sort Method: top-N heapsort  Memory: 76kB
        ->  Hash Left Join  (cost=113.46..63037.44 rows=43949 width=635) (actual time=2.322..6972.679 rows=64467 loops=1)
              Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
              ->  Hash Left Join  (cost=75.78..50968.72 rows=43949 width=607) (actual time=0.650..3809.276 rows=64467 loops=1)
                    Hash Cond: (o.terms_id = terms.terms_id)
                    ->  Hash Left Join  (cost=38.11..50438.25 rows=43949 width=579) (actual time=0.550..3527.558 rows=64467 loops=1)
                          Hash Cond: (o.rep_id = reps.rep_id)
                          ->  Nested Loop Semi Join  (cost=0.43..49796.28 rows=43949 width=551) (actual time=0.015..3200.735 rows=64467 loops=1)
                                ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.002..317.954 rows=64996 loops=1)
                                ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.030..0.030 rows=1 loops=64996)
                                      Index Cond: (invoice_id = o.id)
                                      Filter: (qty < 5)
                                      Rows Removed by Filter: 1
                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.213..0.213 rows=4 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.183..0.192 rows=4 loops=1)
                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.063..0.063 rows=3 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 1kB
                          ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.044..0.050 rows=3 loops=1)
              ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.096..0.096 rows=4 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.071..0.079 rows=4 loops=1)
Total runtime: 7832.750 ms

这是我能指望的最好的吗?我已重新启动服务器(以清理数据库缓存)并在不使用EXPLAIN ANALYZE的情况下重新运行查询。这需要将近5秒钟。它可以进一步改善吗?我有65,000张发票和3,278,436张发票项目。

编辑4

找到它。我按计算结果modified = to_char(modified, 'YYYY-MM-DD HH24:MI:SS')排序。在修改后的发票字段上添加索引并按字段本身排序会使结果低于100毫秒!

最终计划是:

Limit  (cost=1.18..1741.92 rows=100 width=635) (actual time=3.002..27.065 rows=100 loops=1)
  ->  Nested Loop Left Join  (cost=1.18..765042.09 rows=43949 width=635) (actual time=2.989..25.989 rows=100 loops=1)
        ->  Nested Loop Left Join  (cost=1.02..569900.41 rows=43949 width=607) (actual time=0.413..16.863 rows=100 loops=1)
              ->  Nested Loop Left Join  (cost=0.87..386185.48 rows=43949 width=579) (actual time=0.333..15.694 rows=100 loops=1)
                    ->  Nested Loop Semi Join  (cost=0.72..202470.54 rows=43949 width=551) (actual time=0.017..13.965 rows=100 loops=1)
                          ->  Index Scan Backward using invoices_modified_index on invoices o  (cost=0.29..155543.23 rows=64996 width=551) (actual time=0.003..4.543 rows=100 loops=1)
                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.079..0.079 rows=1 loops=100)
                                Index Cond: (invoice_id = o.id)
                                Filter: (qty < 5)
                                Rows Removed by Filter: 1
                    ->  Index Scan using reps_pkey on reps  (cost=0.15..4.17 rows=1 width=36) (actual time=0.007..0.008 rows=1 loops=100)
                          Index Cond: (rep_id = o.rep_id)
              ->  Index Scan using terms_pkey on terms  (cost=0.15..4.17 rows=1 width=36) (actual time=0.003..0.004 rows=1 loops=100)
                    Index Cond: (terms_id = o.terms_id)
        ->  Index Scan using shipvia_pkey on shipvia  (cost=0.15..4.17 rows=1 width=36) (actual time=0.006..0.008 rows=1 loops=100)
              Index Cond: (ship_via_id = o.ship_via_id)
Total runtime: 27.572 ms

太神奇了!谢谢大家的帮助。

2 个答案:

答案 0 :(得分:4)

对于初学者来说,将解释计划发布到http://explain.depesz.com是非常标准的 - 它会为它添加一些漂亮的格式,为您提供一种分发计划的好方法,并让您匿名化可能包含敏感数据的计划。即使您没有分发计划,也可以更容易理解正在发生的事情,有时可以准确地说明瓶颈在哪里。

有无数的资源可以解释postgres解释计划的细节(参见https://wiki.postgresql.org/wiki/Using_EXPLAIN)。在数据库选择计划时会考虑很多细节,但有一些一般概念可以使它更容易。首先,掌握基于页面的数据和索引布局(您不需要知道页面格式的细节,只需知道数据和索引如何分成页面)。从那里,感受两种基本的数据访问方法 - 全表扫描和索引扫描 - 并且稍加思考它应该开始变得清楚不同的情况,其中一个将优先于另一个(同时请记住,甚至不总是可以进行索引扫描。此时,您可以开始查看影响计划选择的一些不同配置项,这些配置项可能会影响计划选择,以支持表扫描或索引扫描。

一旦你做到了这一点,继续前进计划并阅读你找到的不同节点的细节 - 在这个计划中你有很多散列连接,所以请继续阅读。然后,为了比较苹果和苹果,完全禁用散列连接(“set enable_hashjoin = false;”)并再次运行解释分析。现在你看到什么连接方法?阅读它。将该方法的估计成本与散列连接的估计成本进行比较。他们为什么会有所不同?第二个计划的估计成本将高于第一个计划(否则它本来是首选)但是运行第二个计划需要的实际时间呢?是低还是高?

最后,具体解决这个计划。关于那种花费很长时间的那种:明显不是一种功能。 “DISTINCT(id)”并没有说“给我所有只在列id上有区别的行”,而是对行进行排序并根据输出中的所有列获取唯一值(即它等同于写入“不同的身份......”)。你可能应该重新考虑一下你是否真的需要那种截然不同的东西。归一化将倾向于设计出区分的需要,虽然偶尔需要它们,但是它们是否真的非常真实需要并不总是正确的。

答案 1 :(得分:0)

首先追逐花费时间最长的节点,然后开始优化。在您的情况下,似乎是

Seq Scan on invoiceitems items

你应该在那里添加一个索引,并且还要对其他表添加问题。

您还可以尝试增加work_mem以摆脱外部排序。

当你这样做时,新计划看起来可能完全不同,所以重新开始。