加快大表的Postgres更新

时间:2014-05-12 14:22:46

标签: sql database postgresql

我有一个查询需要更新一个包含大约1400万条记录的表。它通过连接表获得从另一个表更新所需的值。像这样......

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
  ->  Nested Loop  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=731559.58..1135721.70 rows=717709 width=26)
              ->  Hash Join  (cost=731559.58..1132133.16 rows=717709 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=397447.29..397447.29 rows=14832429 width=14)
                          ->  Seq Scan on listings a  (cost=0.00..397447.29 rows=14832429 width=14)
                                Filter: (master_ext_id IS NULL)

显然,查看执行计划,您会发现此查询需要很长时间。我在这一点上假设它与查询中涉及的行数有关,但我需要一种方法来加速这种方式。

除了清单表中的约1400万条记录外,listing_to_external_id表中有大约1500万行,external_ids表中大约有1500万条。

我已经尝试将enable_seqscan设置为off,它使用我创建的索引,所以我知道这只是计划程序确定seq扫描会更快的情况。我还ANALYZE了我的桌子。

我尝试使用listing表上的主键来限制更新的行,希望我能够循环并一次更新一行。如你所见,这没什么影响......

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
WHERE a.id >= 34649050
AND a.id <= 35649050
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
  ->  Nested Loop  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=212130.40..600005.71 rows=50585 width=26)
              ->  Hash Join  (cost=212130.40..599752.78 rows=50585 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=50355.96..50355.96 rows=1045401 width=14)
                          ->  Index Scan using listings_pkey on listings a  (cost=0.00..50355.96 rows=1045401 width=14)
                                Index Cond: ((id >= 34649050) AND (id <= 35649050))
                                Filter: (master_ext_id IS NULL)

我尝试调整Postgres上的设置以更好地处理如此大的查询,但这似乎也没什么影响。如果查询本身无法做任何事情,我可以进入这些设置。

我还尝试将listing_to_external_id和external_ids之间的连接结果放入表中,对其进行索引,然后加入该表上的列表。这导致了非常相似的执行计划/成本。

目前还不确定还有什么可做的。只是让查询在周末运行,它仍然在运行。有什么建议吗?

1 个答案:

答案 0 :(得分:1)

您曾两次使用listings表格 - UPDATE中的一个和FROM中的其他一个。看看第一个执行计划。它有一个listings的笛卡尔积(CROSS JOIN)。您只需listings中的UPDATE

尝试类似

的内容
UPDATE listings a
SET master_ext_id = c.id
FROM listing_to_external_id b
JOIN external_ids c on b.external_id = c.id
WHERE a.id = b.listing_id
 AND a.master_ext_id is null
 AND c.provider_id = 0