为什么PostgreSql不使用PK索引?

时间:2015-03-17 18:59:37

标签: postgresql query-optimization

如果我想通过PK从下表中选择0.5%行,甚至5%行,则查询计划程序会正确选择使用PK索引。这是表格:

create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id, 
50 + 50*sin(id) as temperature_in_f, 
50 + 50*sin(id) as humidity_in_percent
from numbers;

alter table weather
add constraint pk_weather primary key(id);

vacuum analyze weather;

统计信息是最新的,以下查询确实使用PK索引:

explain analyze select sum(w.id), sum(humidity_in_percent), count(*) 
from weather as w
where w.id between 1 and 66720;

但是,假设我们需要将此表与另一个更小的一个表连接起来:

create table lightnings 
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;

alter table lightnings
add constraint pk_lightnings
primary key(weather_id);

analyze lightnings;

以下是我的联接,有四种逻辑等效形式:

explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
  where l.weather_id=w.id);

explain analyze select sum(w.id), count(*) 
from weather as w
join lightnings as l
  on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;

explain analyze select sum(w.id), count(*) 
from lightnings as l
join weather as w
  on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;

-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*) 
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;

不幸的是,查询计划程序采用扫描整个天气表的方式:

"Aggregate  (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
"  ->  Hash Join  (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
"        Hash Cond: (w.id = l.weather_id)"
"        ->  Seq Scan on weather w  (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
"              Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
"              Rows Removed by Filter: 1041904"
"        ->  Hash  (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 235kB"
"              ->  Seq Scan on lightnings l  (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"

查询计划程序对将选择天气表中行数的估计是rows = 5106。这或多或少接近6672的确切值。如果我通过id在天气表中选择这么少的行,则使用PK索引。如果我通过与另一个表的连接选择相同的金额,则查询计划程序会用于扫描表格。

我错过了什么?

select version()
"PostgreSQL 9.4.0"

编辑:如果我删除湿度条件,查询计划程序会正确识别weather.id上的条件是非常有选择性的,并选择在PK上使用索引:

explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
  where l.weather_id=w.id);
"Aggregate  (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
"        ->  Seq Scan on lightnings l  (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
"        ->  Index Only Scan using pk_weather on weather w  (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
"              Index Cond: (id = l.weather_id)"
"              Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"

然而,添加条件完全混淆了查询规划器。

2 个答案:

答案 0 :(得分:1)

期望优化器在较大表的PK上使用索引意味着您希望从较小的表驱动查询。当然,您知道较小的表将在较大的表中连接的行与谓词所选的行相同,但优化器不会。

看看计划中的一行:

Hash Join  (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"

它希望连接产生32行,但实际上是6672。

无论如何,它几乎可以选择:

  1. 对较小的表进行完全扫描,并在较大的表上进行索引查找,并使用谓词过滤掉连接后的行(并期望大部分行被过滤掉)。
  2. 对两个表进行完全扫描,其中行由较大的表上的谓词删除,以及结果的散列连接。
  3. 扫描较大的表,其中行被谓词删除,而较小的表上的索引查找可能无法找到值。
  4. 其中第二个被认为成本最低,我认为基于它的证据这样做是正确的,因为散列连接对于连接多行非常有效。

    当然,在这种特殊情况下,在天气(humidity_in_percent,id)上放置一个索引可能会更有效率,但我怀疑这是你真实情况的修改版本(id列的总和?)所以具体建议可能不适用。

答案 1 :(得分:0)

我相信你在使用索引的第一个查询与不包含的其他3个查询之间所看到的差异在于where子句。

在第一个查询中,您的where子句位于w.id上,该索引已编入索引。

在另外3个中,有效的where子句在w.humidity_in_percent上。我测试了以下内容......

create index wh_idx on weather(humidity_in_percent);

explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
  where l.weather_id=w.id);

并获得更好的计划。我试图发布返回的实际计划,但是我无法格式化它以便正确显示,抱歉。