我有两张桌子:
我使用存储过程处理事件并将它们转换为dir_current行。处理事件的第一步是在dir_current表中查找所有没有父项的行。不幸的是,由于父级可能存在于事件表中,因此我们不希望将它们包含在结果中。我想出了这个问题:
SELECT DISTINCT event.parent_path, event.depth FROM sf.event as event
LEFT OUTER JOIN sf.dir_current as dir ON
event.parent_path = dir.path
AND dir.volume_id = 1
LEFT OUTER JOIN sf.event as event2 ON
event.parent_path = event2.path
AND event2.volume_id = 1
AND event2.type = 'DIR'
AND event2.id <= MAX_ID_VARIABLE
WHERE
event.volume_id = 1
AND event.id <= MAX_ID_VARIABLE
AND dir.volume_id IS NULL
AND event2.id IS NULL
ORDER BY depth, parent_path;
MAX_ID_VARIABLE是可变的,限制一次处理的事件数。
下面解释分析结果(explain.depesz.com):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=395165.81..395165.82 rows=1 width=83) (actual time=32009.439..32049.675 rows=2462 loops=1)
-> Sort (cost=395165.81..395165.81 rows=1 width=83) (actual time=32009.432..32021.733 rows=184975 loops=1)
Sort Key: event.depth, event.parent_path
Sort Method: quicksort Memory: 38705kB
-> Nested Loop Anti Join (cost=133385.93..395165.80 rows=1 width=83) (actual time=235.581..30916.912 rows=184975 loops=1)
-> Hash Anti Join (cost=133385.38..395165.14 rows=1 width=83) (actual time=83.073..1703.618 rows=768278 loops=1)
Hash Cond: (event.parent_path = event2.path)
-> Seq Scan on event (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.014..756.014 rows=2000000 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1))
-> Hash (cost=132700.54..132700.54 rows=54787 width=103) (actual time=82.754..82.754 rows=48029 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 6696kB
-> Bitmap Heap Scan on event event2 (cost=6196.07..132700.54 rows=54787 width=103) (actual time=12.979..63.803 rows=48029 loops=1)
Recheck Cond: (type = '16384'::text)
Filter: ((id <= 13000000) AND (volume_id = 1))
Heap Blocks: exact=16465
-> Bitmap Index Scan on event_dir_depth_idx (cost=0.00..6182.38 rows=54792 width=0) (actual time=8.759..8.759 rows=48029 loops=1)
-> Index Only Scan using dircurrent_volumeid_path_unq on dir_current dir (cost=0.55..0.65 rows=1 width=115) (actual time=0.038..0.038 rows=1 loops=768278)
Index Cond: ((volume_id = 1) AND (path = event.parent_path))
Heap Fetches: 583027
Planning time: 2.114 ms
Execution time: 32054.498 ms
最慢的部分是dir_current表上的Index Only Scan(总共32秒从29秒开始)。
我想知道为什么Postgres使用索引扫描而不是顺序扫描,这需要2-3秒。
设置后:
SET enable_indexscan TO false;
SET enable_bitmapscan TO false;
我收到了在3秒内explain.depesz.com运行的查询:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=569654.93..569654.94 rows=1 width=83) (actual time=3943.487..3979.613 rows=2462 loops=1)
-> Sort (cost=569654.93..569654.93 rows=1 width=83) (actual time=3943.481..3954.169 rows=184975 loops=1)
Sort Key: event.depth, event.parent_path
Sort Method: quicksort Memory: 38705kB
-> Hash Anti Join (cost=307875.14..569654.92 rows=1 width=83) (actual time=1393.185..2970.626 rows=184975 loops=1)
Hash Cond: ((event.parent_path = dir.path) AND ((event.depth - 1) = dir.depth))
-> Hash Anti Join (cost=259496.25..521276.01 rows=1 width=83) (actual time=786.617..2111.297 rows=768278 loops=1)
Hash Cond: (event.parent_path = event2.path)
-> Seq Scan on event (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.016..616.598 rows=2000000 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1))
-> Hash (cost=258811.41..258811.41 rows=54787 width=103) (actual time=786.214..786.214 rows=48029 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 6696kB
-> Seq Scan on event event2 (cost=0.00..258811.41 rows=54787 width=103) (actual time=0.068..766.563 rows=48029 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1) AND (type = '16384'::text))
Rows Removed by Filter: 1951971
-> Hash (cost=36960.95..36960.95 rows=761196 width=119) (actual time=582.430..582.430 rows=761196 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 121605kB
-> Seq Scan on dir_current dir (cost=0.00..36960.95 rows=761196 width=119) (actual time=0.010..267.484 rows=761196 loops=1)
Filter: (volume_id = 1)
Planning time: 2.242 ms
Execution time: 3999.213 ms
在运行查询之前分析了两个表。
任何想法为什么Postgres使用远离最佳查询计划? 有没有更好的方法来提高查询性能然后禁用索引/位图扫描?也许不同的查询结果相同?
我正在使用Postgres 9.5.2 我将不胜感激任何帮助。
答案 0 :(得分:2)
您只从一个表中获取列。我建议将查询重写为:
SELECT e.parent_path, e.depth
FROM sf.event e
WHERE e.volume_id = 1 AND e.id <= MAX_ID_VARIABLE AND
NOT EXISTS (SELECT 1
FROM dir_current dc
WHERE e.parent_path = dc.path AND dc.volume_id = 1
) AND
NOT EXISTS (SELECT 1
FROM sf.event e2 ON
e.parent_path = e2.path AND
e2.volume_id = 1 AND
e2.type = 'DIR' AND
e2.id <= MAX_ID_VARIABLE
)
ORDER BY e.depth, e.parent_path;
然后是以下索引:
event(volume_id, id)
dir_current(path, volume_id)
event(path, volume_id, type, id)
我不确定为什么会与MAX_ID_VARIABLE
进行比较。如果没有这种比较,第一个索引可以包括排序键:event(volume_id, depth, parent_path)
。