采取以下两个表:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
和
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
两个主键和members.contact_id
都存在指数。
我需要删除任何没有相关contact
的{{1}}。大约有3MM members
和25MM contact
条记录。
我正在尝试以下两个查询:
member
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
正如您所看到的,即使删除任何记录,两个查询都会显示相似的性能,需要大约3分钟。
服务器磁盘I / O达到100%,所以我假设数据正在溢出到磁盘,因为在DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
和contacts
上都进行了顺序扫描。
服务器是EC2 r3.large(15GB RAM)。
有关如何优化此查询的任何想法?
为两个表运行members
并确保vacuum analyze
设置为enable_mergejoin
后,查询时间没有差异:
on
PG版本:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
关系大小:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
设定:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
尝试批量执行此操作似乎并没有帮助I / O使用(仍然达到100%)并且尽管使用基于索引的计划,但似乎并没有按时完成。
work_mem
----------
64MB
random_page_cost
------------------
4
我必须在DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
之后终止查询,并且在查询运行的整个时间内磁盘I / O保持在100%。我还尝试了1,000和5,000个块,但没有看到任何性能提升。
注意:使用了Time: 1203492.326 ms
范围,因为我知道这些是现有的联系人ID。 (例如57..668
和min(id)
)
答案 0 :(得分:3)
解决这类问题的一种方法是在较小的块中进行。
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1001 AND id < 2000
);
冲洗,重复。尝试使用不同的块大小来为您的数据集找到最佳的一个,使用最少的查询,同时将它们全部保存在内存中。
当然,你可能希望编写脚本,可能是plpgsql,或者你喜欢的任何脚本语言。
答案 1 :(得分:2)
有关如何优化此查询的任何想法?
您的疑问非常完美。我会使用NOT EXISTS
变体。
你的索引index_members_on_contact_id_and_step_id
对它也有好处:
但请参阅下文关于BRIN索引的内容。
您可以调整服务器,表和索引配置。
由于您实际上没有更新或删除多行(根据您的评论几乎没有任何行?),您需要优化读取性能。
您提供了:
服务器是EC2 r3.large(15GB RAM)。
和
PostgreSQL 9.4.4
您的版本严重过时。 至少 升级到最新的次要版本。更好的是,升级到当前的主要版本。 Postgres 9.5和9.6为大数据带来了重大改进 - 这正是您所需要的。
Consider the versioning policy of the project.
基本顺序扫描中预期和实际行数之间出现意外的10%不匹配:
成员c上的Seq扫描(成本= 0.00..1814029.74 行= 24855474 宽度= 10)(实际时间= 1.132..188654.555 行= 27307060 loops = 1 )
根本没有戏剧性,但仍不应出现在此查询中。表示您可能需要调整autovacuum设置 - 可能需要针对非常大的设置进行调整。
更有问题:
Hash Anti Join(成本= 2246088.17..2966677.08 行= 1875003 width = 12)(实际时间= 209406.338..209406.338 rows = 0 loops = 1)< / p>
Postgres希望找到要删除的1875003行,而实际找到0行。那是出乎意料的。也许大幅增加members.contact_id
和contacts.id
上的统计信息目标可以帮助缩小差距,从而可以提供更好的查询计划。参见:
members
中的~25MM行占用23 GB - 每行几乎1kb,这对于您提供的表定义来说似乎过多(即使您提供的总大小应包括索引):< / p>
24 bytes tuple header
4 item pointer
8 null bitmap
36 9x integer
16 2x ts
1 1x bool
? 1x jsonb
请参阅:
每行89个字节 - 或者更少有一些NULL值 - 几乎没有任何对齐填充,所以最多96个字节,加上你的 jsonb
专栏。
要么 jsonb
列非常大,这会让我建议将数据规范化为单独的列或单独的表。考虑:
或您的表格臃肿,这可以通过VACUUM FULL ANALYZE
解决,或者在其中解决:
CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;
但要么对桌子进行独占锁定,你说你买不起。 pg_repack
可以在没有排他锁的情况下完成。参见:
即使我们考虑索引大小,你的表似乎太大了:你有7个小索引,每行36-44个字节,没有膨胀,少于NULL值,所以&lt;总共300个字节。
无论哪种方式,请考虑more aggressive autovacuum settings表格members
。相关:
和/或停止膨胀桌子开始。你是否经常更新行?您更新的任何特定列都很多?那个jsonb
列可能吗?您可以将其移动到单独的(1:1)表中,以便停止使用死元组对主表进行膨胀 - 并保持autovacuum不会执行其工作。
Block range indexes要求Postgres 9.5或更高版本,显着减少索引大小。我在初稿中过于乐观了。如果在members
中contact.id
中有多个行,则物理群集>后,BRIN索引完美 strong>你的桌子至少一次(参见③for fit CLUSTER
命令)。在这种情况下,Postgres可以快速排除整个数据页面。但是您的数字表示每contact.id
只有大约8行,因此数据页通常包含多个值,这会使大部分效果无效。取决于您的数据分发的实际细节......
另一方面,就目前而言,您的元组大小约为1 kb,因此每个数据页只有8行(通常为8kb)。如果那不是臃肿,那么BRIN索引可能会有所帮助。
但您需要先升级服务器版本。见①。
CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);
答案 2 :(得分:1)
更新规划师使用的统计信息,并将enable_mergejoin
设置为on
:
vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;
您应该获得与此类似的查询计划:
explain analyse
delete from contacts
where not exists (
select 1
from members c
where c.contact_id = contacts.id);
QUERY PLAN
----------------------------------------------------------------------
Delete on contacts
-> Merge Anti Join
Merge Cond: (contacts.id = c.contact_id)
-> Index Scan using contacts_pkey on contacts
-> Index Scan using members_contact_id_idx on members c
答案 3 :(得分:0)
以下是另一种尝试:
DELETE FROM contacts
USING contacts c
LEFT JOIN members m
ON c.id = m.contact_id
WHERE m.contact_id IS NULL;
它使用一种技术从here所描述的联合查询中删除。
我无法保证这是否肯定会更快,但可能是因为避免使用子查询。会对结果感兴趣......
答案 4 :(得分:0)
在where子句中使用子查询需要花费很多时间
你应该使用with
和using
这将是很多很多...更快
with
c_not_member as (
-- here extarct the id of contacts that not in members
SELECT
c.id
FROM contacts c LEFT JOIN members m on c.id = m.contact_id
WHERE
-- to get the contact that don't exist in member just
-- use condition in a field on member that cannot be null
-- in this case you have id
m.id is null
-- the only case when m.id is null is when c.id does not have m.contact_id maching c.id
-- in another way c.id doesn't exists in m.contact_id
)
DELETE FROM contacts all_c using c_not_member WHERE all_c.id = not_member.id ;