我在varchar(24)主键上有一个连接查询。这是关键的原因是遗产和改变的目标。但是,postgresql查询规划器坚持要进行顺序扫描,这对我来说似乎是不合理的。我支持“不合理”的说法,因为“SET enable_seqscan = off”将此查询的速度提高了8倍。
我正在进行“真空分析”;我玩过统计设置,但到目前为止没有运气。
查询
select inventry.id, inventry.count, sum(invenwh.count)
from invenwh join inventry on inventry.id=invenwh.id
where inventry.product_c='CAT17'
group by 1, 2;
以下设置数据库以运行此查询。
drop table if exists inventry;
drop table if exists inwh;
drop table if exists invenwh;
drop table if exists inprodcategory;
-- Create 50 product categories.
create table inprodcategory as
select i as id, concat('CAT', lpad(i::text, 2, '0'))::varchar(10) as category
from generate_series(1, 50, 1) as i;
-- Create 245,000 inventory items
create table inventry as
select
concat('ITEM', lpad(i::text, 6, '0'))::varchar(24) as id,
concat('Item #', i::text)::varchar(50) as descr_1,
c.category as product_c,
(case when random() < 0.05 then (random()*70)::int else 0::int end) as count
from generate_series(1, 245000, 1) as i
join inprodcategory as c on c.id=(i%50)::int;
-- Create 70 warehouses
create table inwh as
select concat('WAREHOUSE', lpad(i::text, 2, '0'))::varchar(10) as warehouse
from generate_series(1, 70, 1) as i;
-- Create (ugly) cross-join table with counts/warehouse
create table invenwh as
select id, warehouse,
(case when random() < 0.05 then (random()*10)::int else 0::int end) as count
from inventry, inwh;
create index on invenwh (id);
create index on inventry (id);
运行上述内容后,您可以运行查询。在我的带有SSD,i7和16GB内存的硬件上,需要4秒,但如果我运行“set enable_seqscan = off”,则需要大约500ms。
编辑:添加解释(分析,缓冲)
HashAggregate (cost=449773.25..449822.25 rows=4900 width=19) (actual time=4180.006..4181.092 rows=4900 loops=1)
Group Key: inventry.id, inventry.count
Buffers: shared hit=4526 read=121051
-> Hash Join (cost=5058.50..447200.75 rows=343000 width=19) (actual time=1285.800..4086.398 rows=343000 loops=1)
Hash Cond: ((invenwh.id)::text = (inventry.id)::text)
Buffers: shared hit=4526 read=121051
-> Seq Scan on invenwh (cost=0.00..291651.00 rows=16807000 width=15) (actual time=0.077..1949.843 rows=16807000 loops=1)
Buffers: shared hit=2530 read=121051
-> Hash (cost=4997.25..4997.25 rows=4900 width=15) (actual time=48.897..48.897 rows=4900 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 230kB
Buffers: shared hit=1996
-> Seq Scan on inventry (cost=0.00..4997.25 rows=4900 width=15) (actual time=21.903..47.031 rows=4900 loops=1)
Filter: ((product_c)::text = 'CAT17'::text)
Rows Removed by Filter: 235200
Buffers: shared hit=1996
Planning time: 4.266 ms
Execution time: 4181.395 ms
编辑:具体的后续问题
感谢@a_horse_with_no_name(非常感谢!!)看来降低random_page_cost就好了。这看起来或多或少与https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
一致问:我是否可以运行任何基准来发现random_page_cost的最佳值?在生产中,我在SCSI磁盘(LSI MR9260-8i)上。
问:我觉得统计数据也可能与此相关,但我在互联网上的pg-stats-for-dummies类型页面上显示空洞。有关了解统计数据的任何提示吗?答案 0 :(得分:0)
如果规划师估算的费用与执行时间的实际情况不符,则应调整费用设置以更好地匹配您的硬件。
Planner Cost Constants记录了各种旋钮。
特别是random_page_cost
提出了与您案件相关的建议:
相对于顺序的随机读取成本较低的存储,例如 固态硬盘也可以用较低的值进行更好的建模 对于random_page_cost。
另请参阅Random Page Cost Revisited以获取有关此参数的5种不同存储类型的更多调优建议。
TL; DR:对于SSD,请先1.5
尝试random_page_cost
。