Question

我在varchar（24）主键上有一个连接查询。这是关键的原因是遗产和改变的目标。但是，postgresql查询规划器坚持要进行顺序扫描，这对我来说似乎是不合理的。我支持“不合理”的说法，因为“SET enable_seqscan = off”将此查询的速度提高了8倍。

我正在进行“真空分析”;我玩过统计设置，但到目前为止没有运气。

查询

select inventry.id, inventry.count, sum(invenwh.count) 
from invenwh join inventry on inventry.id=invenwh.id
where inventry.product_c='CAT17' 
group by 1, 2;

以下设置数据库以运行此查询。

drop table if exists inventry;
drop table if exists inwh;
drop table if exists invenwh;
drop table if exists inprodcategory;

-- Create 50 product categories.
create table inprodcategory as 
select i as id, concat('CAT', lpad(i::text, 2, '0'))::varchar(10) as category
from generate_series(1, 50, 1) as i;

-- Create 245,000 inventory items
create table inventry as 
select 
    concat('ITEM', lpad(i::text, 6, '0'))::varchar(24) as id, 
    concat('Item #', i::text)::varchar(50) as descr_1,
    c.category as product_c,
    (case when random() < 0.05 then (random()*70)::int else 0::int end) as count
from generate_series(1, 245000, 1) as i
    join inprodcategory as c on c.id=(i%50)::int;

-- Create 70 warehouses
create table inwh as 
select concat('WAREHOUSE', lpad(i::text, 2, '0'))::varchar(10) as warehouse
from generate_series(1, 70, 1) as i;

-- Create (ugly) cross-join table with counts/warehouse
create table invenwh as 
select id, warehouse, 
    (case when random() < 0.05 then (random()*10)::int else 0::int end) as count
from inventry, inwh;

create index on invenwh (id);
create index on inventry (id);

运行上述内容后，您可以运行查询。在我的带有SSD，i7和16GB内存的硬件上，需要4秒，但如果我运行“set enable_seqscan = off”，则需要大约500ms。

编辑：添加解释（分析，缓冲）

HashAggregate  (cost=449773.25..449822.25 rows=4900 width=19) (actual time=4180.006..4181.092 rows=4900 loops=1)
  Group Key: inventry.id, inventry.count
  Buffers: shared hit=4526 read=121051
  ->  Hash Join  (cost=5058.50..447200.75 rows=343000 width=19) (actual time=1285.800..4086.398 rows=343000 loops=1)
        Hash Cond: ((invenwh.id)::text = (inventry.id)::text)
        Buffers: shared hit=4526 read=121051
        ->  Seq Scan on invenwh  (cost=0.00..291651.00 rows=16807000 width=15) (actual time=0.077..1949.843 rows=16807000 loops=1)
              Buffers: shared hit=2530 read=121051
        ->  Hash  (cost=4997.25..4997.25 rows=4900 width=15) (actual time=48.897..48.897 rows=4900 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 230kB
              Buffers: shared hit=1996
              ->  Seq Scan on inventry  (cost=0.00..4997.25 rows=4900 width=15) (actual time=21.903..47.031 rows=4900 loops=1)
                    Filter: ((product_c)::text = 'CAT17'::text)
                    Rows Removed by Filter: 235200
                    Buffers: shared hit=1996
Planning time: 4.266 ms
Execution time: 4181.395 ms

编辑：具体的后续问题

感谢@a_horse_with_no_name（非常感谢!!）看来降低random_page_cost就好了。这看起来或多或少与https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

一致

问：我是否可以运行任何基准来发现random_page_cost的最佳值？在生产中，我在SCSI磁盘（LSI MR9260-8i）上。

问：我觉得统计数据也可能与此相关，但我在互联网上的pg-stats-for-dummies类型页面上显示空洞。有关了解统计数据的任何提示吗？

Answer 1

如果规划师估算的费用与执行时间的实际情况不符，则应调整费用设置以更好地匹配您的硬件。

Planner Cost Constants记录了各种旋钮。

特别是random_page_cost提出了与您案件相关的建议：

相对于顺序的随机读取成本较低的存储，例如固态硬盘也可以用较低的值进行更好的建模对于random_page_cost。

另请参阅Random Page Cost Revisited以获取有关此参数的5种不同存储类型的更多调优建议。

TL; DR：对于SSD，请先1.5尝试random_page_cost。

当索引适用时，Postgresql执行seq_scan。为什么？

1 个答案: