问题:
SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 64500)
LIMIT 1;
执行时间: 62 ms 。
SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 65000)
LIMIT 1;
执行时间: 4.742秒。
(所有LIMITS仅用于测试)。
巨大的滞后!它呈指数级增长。
TABLES:
CREATE TABLE public.tmp2_import_lightnings_filenames (
new_filename VARCHAR(63) NOT NULL,
CONSTRAINT tmp2_import_lightnings_filenames_pkey PRIMARY KEY(new_filename)
) WITHOUT OIDS;
表格大小:7304字符串
数据示例:/xml/2012-07-13/01/01-24.xml
CREATE TABLE service.import_lightnings_filenames (
id SERIAL,
filename VARCHAR(63) NOT NULL,
imported BOOLEAN DEFAULT false,
strokes_num INTEGER,
CONSTRAINT import_lightnings_filenames_pkey PRIMARY KEY(id)
) WITHOUT OIDS;
CREATE UNIQUE INDEX import_lightnings_filenames_idx
ON service.import_lightnings_filenames
USING btree (filename COLLATE pg_catalog."default");
表格大小:70812字符串
数据示例:44;/xml/2012-05-26/12/12-18.xml;TRUE;NULL
查询计划:
Limit (cost=0.00..2108.11 rows=1 width=29) (actual time=240.183..240.183 rows=1 loops=1)
Buffers: shared hit=539, temp written=307
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7698823.12 rows=3652 width=29) (actual time=240.181..240.181 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=539, temp written=307
SubPlan 1
-> Materialize (cost=0.00..1946.82 rows=64500 width=29) (actual time=0.009..198.313 rows=64500 loops=1)
Buffers: shared hit=538, temp written=307
-> Limit (cost=0.00..1183.32 rows=64500 width=29) (actual time=0.005..113.196 rows=64500 loops=1)
Buffers: shared hit=538
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.004..42.418 rows=64500 loops=1)
Buffers: shared hit=538
Total runtime: 240.982 ms
Limit (cost=0.00..2125.03 rows=1 width=29) (actual time=30734.619..30734.619 rows=1 loops=1)
Buffers: shared hit=547, temp read=112258 written=669
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7760626.00 rows=3652 width=29) (actual time=30734.617..30734.617 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=547, temp read=112258 written=669
SubPlan 1
-> Materialize (cost=0.00..1962.49 rows=65000 width=29) (actual time=0.798..42.306 rows=64820 loops=363)
Buffers: shared hit=543, temp read=112258 written=669
-> Limit (cost=0.00..1192.49 rows=65000 width=29) (actual time=0.005..116.110 rows=65000 loops=1)
Buffers: shared hit=543
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.003..43.804 rows=65000 loops=1)
Buffers: shared hit=543
Total runtime: 30735.267 ms
我做错了什么?
答案 0 :(得分:3)
性能下降的原因似乎是您用尽work_mem
并且materialize
步骤开始交换到磁盘。我在这里引用手册:
work_mem(整数)
[...]散列表用于散列连接, 基于散列的聚合,以及 IN子查询的基于散列的处理。
强调我的。通过提高work_mem
的设置并再次运行查询来验证这一点。正如评论中提供的@a_horse,通过调用:
set work_mem = '64MB';
你不需要你的系统管理员。您可以在会话中重置为默认值:
reset work_mem;
设置将在会话结束时消失。更改postgresql.conf
(并重新加载)中的设置以获得永久效果。
许多PostgreSQL软件包都带有非常保守的设置(默认为1MB)。这在很大程度上取决于您的工作负载,但一般情况下,一台4 GB或更多的计算机上的16 MB是最小的。我在具有12 GB RAM的专用数据库服务器上使用64 MB - 只有很少的并发用户。
您可能需要对设置进行一些常规调整。以下是general performance optimization in the PostgreSQL Wiki的指针列表。您还可以在链接后找到有关work_mem
调整的更多信息。
除此之外,重写您的查询也可能会加快速度。带有大列表的IN
子查询往往是PostgreSQL中最慢的选择。
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
LEFT JOIN (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x ON t.new_filename = x.filename
WHERE x.filename IS NULL;
特别是service.import_lightnings_filenames
:
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x
WHERE t.new_filename = x.filename
);
与 CTE 相同(可能不会更快,但更容易阅读):
WITH x AS (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
)
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM x
WHERE t.new_filename = x.filename
);
答案 1 :(得分:0)
-- SET work_mem=20000;
SET random_page_cost=1.1;
SET effective_cache_size=10000000;
将work_mem
设置为1-20 MB将更喜欢哈希表(只要它们适合核心)这对于小到中等大小的查询是有效的。
设置random_page_cost
更低将导致查询生成器在需要时更喜欢索引扫描。这是OP的第一个和第二个查询之间的跳闸点。 (但是,跳过索引扫描阶段,支持seqscan)
默认值(= 4)太高了)
effective_cache_size
是OS维护的LRU缓冲量的估计值。将其设置得尽可能高(不引起交换)