Question

我有一个非常奇怪的数据集，其中来自大表的几个记录根本没有任何数据，但是当他们这样做的时候就是成千上万的记录。我正在尝试只选择有数据的记录，但我在索引使用方面遇到了一些问题。我知道你通常不能“强制”postgresql使用某些索引，但在这种情况下它可以工作。

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) > 0 ORDER BY count(frames.id) DESC;
 id | count  
----+--------
 31 | 123363
 28 | 121475
 24 | 110155
 21 | 108258
 22 | 106837
 25 |  89182
 26 |  87104
 27 |  86152
(8 rows)

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) = 0 ORDER BY count(frames.id) DESC;
....
(568 rows)

我发现的两个解决方案是：

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 11697,645 ms


or

SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"
Time: 879,325 ms

两个查询似乎都没有在frames表的match_id上使用索引。这是不合适的，因为通常它不是很有选择性，不幸的是在这里它会非常有用。为：

SET enable_seqscan = OFF;
SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 1,239 ms

EXPLAIN for queries：

EXPLAIN for: SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"

                                 QUERY PLAN
-----------------------------------------------------------------------------
 HashAggregate  (cost=59253.47..59256.38 rows=290 width=155)
   ->  Hash Join  (cost=6.26..33716.73 rows=785746 width=155)
         Hash Cond: (frames.match_id = matches.id)
         ->  Seq Scan on frames  (cost=0.00..22906.46 rows=785746 width=4)
         ->  Hash  (cost=4.45..4.45 rows=145 width=155)
               ->  Seq Scan on matches  (cost=0.00..4.45 rows=145 width=155)
(6 rows)

EXPLAIN：SELECT“匹配”。* FROM“匹配”WHERE（EXISTS（SELECT id FROM frames WHERE frames.match_id = matches.id LIMIT 1））查询计划

Seq Scan on matches  (cost=0.00..41.17 rows=72 width=155)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Limit  (cost=0.00..0.25 rows=1    width=4)                                                                                                                      
       ->  Seq Scan on frames  (cost=0.00..24870.83 rows=98218 width=4)                                                                                           
                Filter: (match_id = matches.id)

（6行）

SET enable_seqscan = OFF;

EXPLAIN SELECT“匹配”。* FROM“匹配”WHERE（SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1）; 查询计划

Seq Scan on matches  (cost=10000000000.00..10000000118.37 rows=72 width=155)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Limit  (cost=0.00..0.79 rows=1 width=0)
          ->  Index Scan using index_frames_on_match_id on frames  (cost=0.00..81762.68 rows=104066 width=0)
                Index Cond: (match_id = matches.id)

（6行）

有关如何调整查询以使用索引的任何建议吗？也许还有其他方法可以解决那些执行距离接近1毫秒的记录，然后我会退出11秒的索引？

PS。我确实运行了ANALYZE，VACUM ANALYZE，通常建议的所有步骤，以提高索引使用率。

编辑感谢David Aldridge指出LIMIT 1可能实际上阻碍了我现在的查询计划：

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
Time: 163,803 ms

有了计划：

EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Nested Loop  (cost=25455.58..25457.90 rows=8 width=155)
   ->  HashAggregate  (cost=25455.58..25455.66 rows=8 width=4)
         ->  Seq Scan on frames  (cost=0.00..23374.26 rows=832526 width=4)
   ->  Index Scan using matches_pkey on matches  (cost=0.00..0.27 rows=1 width=155)
         Index Cond: (id = frames.match_id)
(5 rows)

仅使用索引版本仍然慢100倍（可能是因为仍在执行的帧上的Seq Scan + Hash Aggregate）

Answer 1

在基于EXISTS的替代方案中，LIMIT子句是多余的，但可能无法帮助优化器。

尝试：

SELECT "matches".*
FROM   "matches"
WHERE  EXISTS (SELECT 1
                 FROM frames
                WHERE frames.match_id = matches.id);

Postgresql索引未使用

1 个答案: