我有一个查询需要更新一个包含大约1400万条记录的表。它通过连接表获得从另一个表更新所需的值。像这样......
UPDATE listings
SET master_ext_id = c.id
FROM listings a
JOIN listing_to_external_id b on a.id = b.listing_id
JOIN external_ids c on b.external_id = c.id
AND a.master_ext_id is null
AND c.provider_id = 0
Update on listings (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
-> Nested Loop (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
-> Seq Scan on listings (cost=0.00..397447.29 rows=14832429 width=611)
-> Materialize (cost=731559.58..1135721.70 rows=717709 width=26)
-> Hash Join (cost=731559.58..1132133.16 rows=717709 width=26)
Hash Cond: (b.listing_id = a.id)
-> Hash Join (cost=148706.93..526852.10 rows=717709 width=28)
Hash Cond: (b.external_id = c.id)
-> Seq Scan on listing_to_external_id b (cost=0.00..236589.51 rows=15357551 width=22)
-> Hash (cost=139735.49..139735.49 rows=717715 width=14)
-> Index Scan using ei_provider_id on external_ids c (cost=0.00..139735.49 rows=717715 width=14)
Index Cond: (provider_id = 0)
-> Hash (cost=397447.29..397447.29 rows=14832429 width=14)
-> Seq Scan on listings a (cost=0.00..397447.29 rows=14832429 width=14)
Filter: (master_ext_id IS NULL)
显然,查看执行计划,您会发现此查询需要很长时间。我在这一点上假设它与查询中涉及的行数有关,但我需要一种方法来加速这种方式。
除了清单表中的约1400万条记录外,listing_to_external_id表中有大约1500万行,external_ids表中大约有1500万条。
我已经尝试将enable_seqscan设置为off,它使用我创建的索引,所以我知道这只是计划程序确定seq扫描会更快的情况。我还ANALYZE
了我的桌子。
我尝试使用listing表上的主键来限制更新的行,希望我能够循环并一次更新一行。如你所见,这没什么影响......
UPDATE listings
SET master_ext_id = c.id
FROM listings a
JOIN listing_to_external_id b on a.id = b.listing_id
JOIN external_ids c on b.external_id = c.id
WHERE a.id >= 34649050
AND a.id <= 35649050
AND a.master_ext_id is null
AND c.provider_id = 0
Update on listings (cost=212130.40..9379727588.60 rows=750294018398 width=637)
-> Nested Loop (cost=212130.40..9379727588.60 rows=750294018398 width=637)
-> Seq Scan on listings (cost=0.00..397447.29 rows=14832429 width=611)
-> Materialize (cost=212130.40..600005.71 rows=50585 width=26)
-> Hash Join (cost=212130.40..599752.78 rows=50585 width=26)
Hash Cond: (b.listing_id = a.id)
-> Hash Join (cost=148706.93..526852.10 rows=717709 width=28)
Hash Cond: (b.external_id = c.id)
-> Seq Scan on listing_to_external_id b (cost=0.00..236589.51 rows=15357551 width=22)
-> Hash (cost=139735.49..139735.49 rows=717715 width=14)
-> Index Scan using ei_provider_id on external_ids c (cost=0.00..139735.49 rows=717715 width=14)
Index Cond: (provider_id = 0)
-> Hash (cost=50355.96..50355.96 rows=1045401 width=14)
-> Index Scan using listings_pkey on listings a (cost=0.00..50355.96 rows=1045401 width=14)
Index Cond: ((id >= 34649050) AND (id <= 35649050))
Filter: (master_ext_id IS NULL)
我尝试调整Postgres上的设置以更好地处理如此大的查询,但这似乎也没什么影响。如果查询本身无法做任何事情,我可以进入这些设置。
我还尝试将listing_to_external_id和external_ids之间的连接结果放入表中,对其进行索引,然后加入该表上的列表。这导致了非常相似的执行计划/成本。
目前还不确定还有什么可做的。只是让查询在周末运行,它仍然在运行。有什么建议吗?
答案 0 :(得分:1)
您曾两次使用listings
表格 - UPDATE
中的一个和FROM
中的其他一个。看看第一个执行计划。它有一个listings
的笛卡尔积(CROSS JOIN)。您只需listings
中的UPDATE
。
尝试类似
的内容UPDATE listings a
SET master_ext_id = c.id
FROM listing_to_external_id b
JOIN external_ids c on b.external_id = c.id
WHERE a.id = b.listing_id
AND a.master_ext_id is null
AND c.provider_id = 0