我有一个非常奇怪的数据集,其中来自大表的几个记录根本没有任何数据,但是当他们这样做的时候就是成千上万的记录。 我正在尝试只选择有数据的记录,但我在索引使用方面遇到了一些问题。我知道你通常不能“强制”postgresql使用某些索引,但在这种情况下它可以工作。
SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) > 0 ORDER BY count(frames.id) DESC;
id | count
----+--------
31 | 123363
28 | 121475
24 | 110155
21 | 108258
22 | 106837
25 | 89182
26 | 87104
27 | 86152
(8 rows)
SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) = 0 ORDER BY count(frames.id) DESC;
....
(568 rows)
我发现的两个解决方案是:
SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 11697,645 ms
or
SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"
Time: 879,325 ms
两个查询似乎都没有在frames表的match_id上使用索引。这是不合适的,因为通常它不是很有选择性,不幸的是在这里它会非常有用。为:
SET enable_seqscan = OFF;
SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 1,239 ms
EXPLAIN for queries:
EXPLAIN for: SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"
QUERY PLAN
-----------------------------------------------------------------------------
HashAggregate (cost=59253.47..59256.38 rows=290 width=155)
-> Hash Join (cost=6.26..33716.73 rows=785746 width=155)
Hash Cond: (frames.match_id = matches.id)
-> Seq Scan on frames (cost=0.00..22906.46 rows=785746 width=4)
-> Hash (cost=4.45..4.45 rows=145 width=155)
-> Seq Scan on matches (cost=0.00..4.45 rows=145 width=155)
(6 rows)
EXPLAIN:SELECT“匹配”。* FROM“匹配”WHERE(EXISTS(SELECT id FROM frames WHERE frames.match_id = matches.id LIMIT 1)) 查询计划
Seq Scan on matches (cost=0.00..41.17 rows=72 width=155)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..0.25 rows=1 width=4)
-> Seq Scan on frames (cost=0.00..24870.83 rows=98218 width=4)
Filter: (match_id = matches.id)
(6行)
SET enable_seqscan = OFF;
EXPLAIN SELECT“匹配”。* FROM“匹配”WHERE(SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1); 查询计划
Seq Scan on matches (cost=10000000000.00..10000000118.37 rows=72 width=155)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..0.79 rows=1 width=0)
-> Index Scan using index_frames_on_match_id on frames (cost=0.00..81762.68 rows=104066 width=0)
Index Cond: (match_id = matches.id)
(6行)
有关如何调整查询以使用索引的任何建议吗?也许还有其他方法可以解决那些执行距离接近1毫秒的记录,然后我会退出11秒的索引?
PS。我确实运行了ANALYZE,VACUM ANALYZE,通常建议的所有步骤,以提高索引使用率。
编辑感谢David Aldridge指出LIMIT 1可能实际上阻碍了我现在的查询计划:
SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
Time: 163,803 ms
有了计划:
EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=25455.58..25457.90 rows=8 width=155)
-> HashAggregate (cost=25455.58..25455.66 rows=8 width=4)
-> Seq Scan on frames (cost=0.00..23374.26 rows=832526 width=4)
-> Index Scan using matches_pkey on matches (cost=0.00..0.27 rows=1 width=155)
Index Cond: (id = frames.match_id)
(5 rows)
仅使用索引版本仍然慢100倍(可能是因为仍在执行的帧上的Seq Scan + Hash Aggregate)
答案 0 :(得分:2)
在基于EXISTS的替代方案中,LIMIT子句是多余的,但可能无法帮助优化器。
尝试:
SELECT "matches".*
FROM "matches"
WHERE EXISTS (SELECT 1
FROM frames
WHERE frames.match_id = matches.id);