Postgres多连接查询优化与多个表的条件

时间:2016-08-02 20:51:10

标签: sql database postgresql join query-optimization

概要

在具有多行的多表连接中添加某些条件会导致查询速度降低数量级。我已经尝试了很多方法来加快速度,包括每种类型的表连接,重新排序连接,重新排序WHERE子句,执行子查询,使用WHERE子句中的CASE语句等等。

SQL细节如下。

问题

  1. 为什么添加这种简单的条件会导致规划人员大幅改变其执行计划?
  2. 是否可以告诉规划人员如何在不彻底更改查询或执行子查询的情况下首先分析特定条件(例如,使用WITH
  3. 注意:我正在尝试为API编写通用SQL构建器,允许调用者在图中的任何位置指定任意条件。问题在于其中一些调用非常快,而其他调用并不是由于Postgres计划执行的方式。专门为此查询精心设计的优化不会帮助我满足通用SQL构建器的更大目标。

    详情

    我在Postgres中有一个存储顶点和边(一个简单的图形数据库)的模式:

    CREATE TABLE IF NOT EXISTS vertex (type text, id serial, name text, data jsonb, UNIQUE (id))
    CREATE INDEX vertex_data_idx ON vertex USING gin (data jsonb_path_ops)
    CREATE INDEX vertex_type_idx ON vertex (type)
    CREATE INDEX vertex_name_idx ON vertex (name)
    CREATE TABLE IF NOT EXISTS edge (src integer REFERENCES vertex (id), dst integer REFERENCES vertex (id))
    CREATE INDEX edge_src_idx ON edge (src)
    CREATE INDEX edge_dst_idx ON edge (dst)
    

    架构存储图形,其中一个是这样的:PLANET - > CONTINENT - >国家 - > REGION

    我的示例数据库中有447554个总顶点和3155047个总边,但是相关数据在这里:

    • 5个行星(每个都与5个大陆有关)
    • 25个大陆(每个涉及2500个国家)
    • 62500个国家(其中25%与100个区域相关,其余各区域没有区域关系)
    • 250000 REGIONs

    此查询查找在任何给定区域中具有西班牙语发言者的行星很快:

    SELECT DISTINCT 
        v1.name as name, v1.id as id 
    FROM vertex v1 
      LEFT JOIN edge e1 ON (v1.id = e1.src) 
      LEFT JOIN vertex v2 ON (v2.id = e1.dst) 
      LEFT JOIN edge e2 ON (v2.id = e2.src) 
      LEFT JOIN vertex v3 ON (v3.id = e2.dst) 
      LEFT JOIN edge e3 ON (v3.id = e3.src) 
      LEFT JOIN vertex v4 ON (v4.id = e3.dst)
    WHERE
      v4.type = 'REGION' AND 
      v4.data @> '{"languages":["spanish"]}'::jsonb
    

    规划时间:6.289毫秒  执行时间:0.744毫秒

    当我在图表(v1)的第一个表中的索引列上添加一个对结果没有影响的条件时,查询慢12,657次

    SELECT DISTINCT 
        v1.name as name, v1.id as id 
    FROM vertex v1 
      LEFT JOIN edge e1 ON (v1.id = e1.src) 
      LEFT JOIN vertex v2 ON (v2.id = e1.dst) 
      LEFT JOIN edge e2 ON (v2.id = e2.src) 
      LEFT JOIN vertex v3 ON (v3.id = e2.dst) 
      LEFT JOIN edge e3 ON (v3.id = e3.src) 
      LEFT JOIN vertex v4 ON (v4.id = e3.dst)
    WHERE
      v1.type = 'PLANET' AND 
      v4.type = 'REGION' AND 
      v4.data @> '{"languages":["spanish"]}'::jsonb
    

    规划时间:7.664毫秒  执行时间:89010.096 ms

    这是第一个快速通话中的EXPLAIN(ANALYZE,BUFFERS):

     Unique  (cost=154592.03..155453.96 rows=114925 width=28) (actual time=0.585..0.616 rows=4 loops=1)
       Buffers: shared hit=92
       ->  Sort  (cost=154592.03..154879.34 rows=114925 width=28) (actual time=0.579..0.588 rows=4 loops=1)
             Sort Key: v1.name, v1.id
             Sort Method: quicksort  Memory: 17kB
             Buffers: shared hit=92
             ->  Nested Loop  (cost=37.96..142377.39 rows=114925 width=28) (actual time=0.155..0.549 rows=4 loops=1)
                   Buffers: shared hit=92
                   ->  Nested Loop  (cost=37.53..80131.76 rows=114925 width=4) (actual time=0.141..0.468 rows=4 loops=1)
                         Join Filter: (v2.id = e1.dst)
                         Buffers: shared hit=76
                         ->  Nested Loop  (cost=37.10..49179.08 rows=14270 width=8) (actual time=0.126..0.386 rows=4 loops=1)
                               Buffers: shared hit=60
                               ->  Nested Loop  (cost=36.68..41450.17 rows=14270 width=4) (actual time=0.112..0.304 rows=4 loops=1)
                                     Join Filter: (v3.id = e2.dst)
                                     Buffers: shared hit=44
                                     ->  Nested Loop  (cost=36.25..37606.57 rows=1772 width=8) (actual time=0.092..0.209 rows=4 loops=1)
                                           Buffers: shared hit=28
                                           ->  Nested Loop  (cost=35.83..36646.82 rows=1772 width=4) (actual time=0.074..0.116 rows=4 loops=1)
                                                 Buffers: shared hit=12
                                                 ->  Bitmap Heap Scan on vertex v4  (cost=30.99..1514.00 rows=220 width=4) (actual time=0.039..0.042 rows=1 loops=1)
                                                       Recheck Cond: (data @> '{"languages":["spanish"]}'::jsonb)
                                                       Filter: (type = 'REGION'::text)
                                                       Heap Blocks: exact=1
                                                       Buffers: shared hit=5
                                                       ->  Bitmap Index Scan on vertex_data_idx  (cost=0.00..30.94 rows=392 width=0) (actual time=0.020..0.020 rows=1 loops=1)
                                                             Index Cond: (data @> '{"languages":["spanish"]}'::jsonb)
                                                             Buffers: shared hit=4
                                                 ->  Bitmap Heap Scan on edge e3  (cost=4.84..159.12 rows=57 width=8) (actual time=0.023..0.037 rows=4 loops=1)
                                                       Recheck Cond: (dst = v4.id)
                                                       Heap Blocks: exact=4
                                                       Buffers: shared hit=7
                                                       ->  Bitmap Index Scan on edge_dst_idx  (cost=0.00..4.82 rows=57 width=0) (actual time=0.013..0.013 rows=4 loops=1)
                                                             Index Cond: (dst = v4.id)
                                                             Buffers: shared hit=3
                                           ->  Index Only Scan using vertex_id_key on vertex v3  (cost=0.42..0.53 rows=1 width=4) (actual time=0.008..0.011 rows=1 loops=4)
                                                 Index Cond: (id = e3.src)
                                                 Heap Fetches: 4
                                                 Buffers: shared hit=16
                                     ->  Index Scan using edge_dst_idx on edge e2  (cost=0.43..1.46 rows=57 width=8) (actual time=0.008..0.011 rows=1 loops=4)
                                           Index Cond: (dst = e3.src)
                                           Buffers: shared hit=16
                               ->  Index Only Scan using vertex_id_key on vertex v2  (cost=0.42..0.53 rows=1 width=4) (actual time=0.006..0.009 rows=1 loops=4)
                                     Index Cond: (id = e2.src)
                                     Heap Fetches: 4
                                     Buffers: shared hit=16
                         ->  Index Scan using edge_dst_idx on edge e1  (cost=0.43..1.46 rows=57 width=8) (actual time=0.005..0.008 rows=1 loops=4)
                               Index Cond: (dst = e2.src)
                               Buffers: shared hit=16
                   ->  Index Scan using vertex_id_key on vertex v1  (cost=0.42..0.53 rows=1 width=28) (actual time=0.006..0.009 rows=1 loops=4)
                         Index Cond: (id = e1.src)
                         Buffers: shared hit=16
     Planning time: 6.940 ms
     Execution time: 0.714 ms
    

    第二,慢速通话:

     HashAggregate  (cost=592.23..592.24 rows=1 width=28) (actual time=89009.873..89009.885 rows=4 loops=1)
       Group Key: v1.name, v1.id
       Buffers: shared hit=11644657 read=1240045
       ->  Nested Loop  (cost=2.98..592.22 rows=1 width=28) (actual time=9098.961..89009.833 rows=4 loops=1)
             Buffers: shared hit=11644657 read=1240045
             ->  Nested Loop  (cost=2.56..306.89 rows=522 width=32) (actual time=0.424..30066.007 rows=3092522 loops=1)
                   Buffers: shared hit=454795 read=46267
                   ->  Nested Loop  (cost=2.13..86.31 rows=65 width=36) (actual time=0.306..2120.293 rows=62500 loops=1)
                         Buffers: shared hit=239162 read=12162
                         ->  Nested Loop  (cost=1.70..51.10 rows=65 width=32) (actual time=0.261..574.490 rows=62500 loops=1)
                               Buffers: shared hit=488 read=562
    actual time=0.205..1.206 rows=25 loops=1)p  (cost=1.27..23.95 rows=8 width=36) (--More--
                                     Buffers: shared hit=109 read=17
                                     ->  Nested Loop  (cost=0.85..19.62 rows=8 width=32) (actual time=0.173..0.547 rows=25 loops=1)
                                           Buffers: shared hit=12 read=14
                                           ->  Index Scan using vertex_type_idx on vertex v1  (cost=0.42..8.44 rows=1 width=28) (actual time=0.123..0.153 rows=5 loops=1)
                                                 Index Cond: (type = 'PLANET'::text)
                                                 Buffers: shared hit=2 read=4
                                           ->  Index Scan using edge_src_idx on edge e1  (cost=0.43..10.18 rows=100 width=8) (actual time=0.021..0.039 rows=5 loops=5)
                                                 Index Cond: (src = v1.id)
                                                 Buffers: shared hit=10 read=10
                                     ->  Index Only Scan using vertex_id_key on vertex v2  (cost=0.42..0.53 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=25)
                                           Index Cond: (id = e1.dst)
                                           Heap Fetches: 25
                                           Buffers: shared hit=97 read=3
    43..2.39 rows=100 width=8) (actual time=0.031..8.504 rows=2500 loops=25)(cost=0.--More--
                                     Index Cond: (src = v2.id)
                                     Buffers: shared hit=379 read=545
                         ->  Index Only Scan using vertex_id_key on vertex v3  (cost=0.42..0.53 rows=1 width=4) (actual time=0.010..0.013 rows=1 loops=62500)
                               Index Cond: (id = e2.dst)
                               Heap Fetches: 62500
                               Buffers: shared hit=238674 read=11600
                   ->  Index Scan using edge_src_idx on edge e3  (cost=0.43..2.39 rows=100 width=8) (actual time=0.013..0.163 rows=49 loops=62500)
                         Index Cond: (src = v3.id)
                         Buffers: shared hit=215633 read=34105
             ->  Index Scan using vertex_id_key on vertex v4  (cost=0.42..0.54 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=3092522)
                   Index Cond: (id = e3.dst)
                   Filter: ((data @> '{"languages":["spanish"]}'::jsonb) AND (type = 'REGION'::text))
                   Rows Removed by Filter: 1
                   Buffers: shared hit=11189862 read=1193778
     Planning time: 7.664 ms
     Execution time: 89010.096 ms
    

2 个答案:

答案 0 :(得分:1)

[张贴为答案,因为我需要格式化]

边缘表绝对需要一个主键(这意味着{src,dst}的NOT NULL很好):

CREATE TABLE IF NOT EXISTS edge
    ( src integer NOT NULL REFERENCES vertex (id)
    , dst integer NOT NULL REFERENCES vertex (id)
    , PRIMARY KEY (src,dst)
    );
CREATE UNIQUE INDEX edge_dst_src_idx ON edge (dst, src);

-- the estimates in the question seem to be off, statistics may be absent.
VACUUM ANALYZE edge; -- refresh the statistics
VACUUM ANALYZE vertex;

我也将{type,name}索引组合在一起(类型似乎具有非常低的基数)。也许甚至使它独特而不是空,但我不知道你的数据。

CREATE INDEX vertex_type_name_idx ON vertex (type, name);

答案 1 :(得分:0)

我认为使用子查询会使postgresql无法使用索引。因此,请尝试通过以下查询来测试性能改进,方法是不使用索引:

select * from (
SELECT DISTINCT 
    v1.name as name, v1.id as id, v1.type as v1_type
FROM vertex v1 
  LEFT JOIN edge e1 ON (v1.id = e1.src) 
  LEFT JOIN vertex v2 ON (v2.id = e1.dst) 
  LEFT JOIN edge e2 ON (v2.id = e2.src) 
  LEFT JOIN vertex v3 ON (v3.id = e2.dst) 
  LEFT JOIN edge e3 ON (v3.id = e3.src) 
  LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
  v4.type = 'REGION' AND 
  v4.data @> '{"languages":["spanish"]}'::jsonb
) t1 
where v1_type = 'PLANET'