Question

我正在使用另一个表中的数据更新一个表上的列。 WHERE子句基于多个列，其中一些列为null。根据我的想法，这个空值是throwing off标准UPDATE TABLE SET X=Y WHERE A=B语句。

根据table_one中的数据，查看我尝试更新table_two的两个表中的this SQL Fiddle。我的查询目前看起来像这样：

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number

对于某些行，查询失败，因为table_one.x中的任何一列中的任何列都为null时，Distinct On未获得更新。即只有在所有列都有一些数据时才会更新。

这个问题与我之前的here on SO有关，我使用EXPLAIN从大型数据集中获取不同的值。我现在想要的是使用具有唯一字段的表中的值填充大数据集。

更新

我使用了@binotenary提供的第一个更新声明。对于小型表，它会在闪存中运行。示例有一个包含20,000条记录的表，更新在20秒内完成。但到目前为止，另一张有900万条记录的表已经运行了20个小时！请参阅下面Update on table_one (cost=0.00..210634237338.87 rows=13615011125 width=1996) -> Nested Loop (cost=0.00..210634237338.87 rows=13615011125 width=1996) Join Filter: ((((my_update_statement_here)))) -> Seq Scan on table_one (cost=0.00..610872.62 rows=9661262 width=1986) -> Seq Scan on table_two (cost=0.00..6051.98 rows=299998 width=148)函数的输出

EXPLAIN ANALYZE

allauth选项也是永远的，所以我取消了它。

有关如何更快地进行此类更新的任何想法？即使它意味着使用不同的更新语句，甚至使用自定义函数循环并执行更新。

Answer 1

由于null = null评估为false，除了等式检查之外，还需要检查两个字段是否null：

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
    (table_one.invoice_number = table_two.invoice_number 
        OR (table_one.invoice_number is null AND table_two.invoice_number is null))
    AND
    (table_one.submitted_by = table_two.submitted_by 
        OR (table_one.submitted_by is null AND table_two.submitted_by is null))
    AND 
    -- etc

您还可以使用更具可读性的coalesce函数：

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
    coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
    AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
    AND -- etc

但是你需要注意默认值（coalesce的最后一个参数）它的数据类型应该与列类型匹配（例如，您最终不会将日期与数字进行比较），默认情况下应该不会出现在数据中 E.g coalesce(null, 1) = coalesce(1, 1)是您想要避免的情况。

更新（关于性能）：

Seq Scan on table_two - 这表示您在table_two上没有任何索引因此，如果您更新table_one中的行，然后在table_two中找到匹配的行，则数据库必须逐个扫描所有行，直到找到匹配为止。
如果对相关列进行索引，则可以更快地找到匹配的行。

另一方面，如果table_one有任何索引，则会降低更新速度根据{{3}}：

表约束和索引会严重延迟每次写入。如果可能，您应该在更新运行时删除所有索引，触发器和外键，并在最后重新创建它们。

同一指南的另一个可能有用的建议是：

如果您可以使用（例如）顺序ID对数据进行分段，则可以批量逐步更新行。

例如，如果table_one id列，您可以添加类似

的内容

and table_one.id between x and y

到where条件，并多次运行查询，更改x和y的值，以便覆盖所有行。

EXPLAIN ANALYZE选项也永远

在处理带有副作用的语句时，使用ANALYZE选项与EXPLAIN时可能要小心。根据{{3}}：

请记住，在使用ANALYZE选项时，实际上会执行该语句。尽管EXPLAIN将丢弃SELECT将返回的任何输出，但该语句的其他副作用将照常发生。

Answer 2

尝试下面的内容，类似于上面的@binoternary。只是打败了我的答案。

update table_one
set column_x = (select column_y from table_two 
where 
(( table_two.invoice_number = table_one.invoice_number)OR (table_two.invoice_number IS NULL AND table_one.invoice_number IS NULL))
and ((table_two.submitted_by=table_one.submitted_by)OR (table_two.submitted_by IS NULL AND table_one.submitted_by IS NULL)) 
and ((table_two.passport_number=table_one.passport_number)OR (table_two.passport_number IS NULL AND table_one.passport_number IS NULL)) 
and ((table_two.driving_license_number=table_one.driving_license_number)OR (table_two.driving_license_number IS NULL AND table_one.driving_license_number IS NULL)) 
and ((table_two.national_id_number=table_one.national_id_number)OR (table_two.national_id_number IS NULL AND table_one.national_id_number IS NULL)) 
and ((table_two.tax_pin_identification_number=table_one.tax_pin_identification_number)OR (table_two.tax_pin_identification_number IS NULL AND table_one.tax_pin_identification_number IS NULL)) 
and ((table_two.vat_number=table_one.vat_number)OR (table_two.vat_number IS NULL AND table_one.vat_number IS NULL)) 
and ((table_two.ggcg_number=table_one.ggcg_number)OR (table_two.ggcg_number IS NULL AND table_one.ggcg_number IS NULL)) 
and ((table_two.national_association_number=table_one.national_association_number)OR (table_two.national_association_number IS NULL AND table_one.national_association_number IS NULL)) 
);

Answer 3

您可以使用Oracle的NVL之类的空检查功能。对于Postgres，您必须使用ArrayList()。

即。您的查询可能如下所示：

UPDATE table_one SET table_one.x =(select  table_two.y from table_one,table_two
WHERE 
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1) 
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1))

where table_one.table_one_pk in  (select  table_one.table_one_pk from table_one,table_two
WHERE 
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1) 
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1));

Answer 4

您当前的查询使用Nested Loop连接两个表，这意味着服务器处理

9,661,262 * 299,998 = 2,898,359,277,476

行。难怪它需要永远。

要使连接有效，您需要在所有连接列上使用索引。问题是NULL值。

如果在连接列上使用函数，通常不能使用索引。

如果您在JOIN：

中使用这样的表达式

coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')

无法使用索引。

因此，我们需要一个索引，我们需要对NULL值执行某些操作以使索引可用。

我们不需要在table_one进行任何更改，因为无论如何都必须对其进行全面扫描。

但是，table_two肯定可以改进。要么更改表本身，要么创建单独的（临时）表。它只有300K行，所以它应该不是问题。

使JOIN中使用的所有列都为NOT NULL。

CREATE TABLE table_two (
    id int4 NOT NULL,
    invoice_number varchar(30) NOT NULL,
    submitted_by varchar(20) NOT NULL,
    passport_number varchar(30) NOT NULL,
    driving_license_number varchar(30) NOT NULL,
    national_id_number varchar(30) NOT NULL,
    tax_pin_identification_number varchar(30) NOT NULL,
    vat_number varchar(30) NOT NULL,
    ggcg_number varchar(30) NOT NULL,
    national_association_number varchar(30) NOT NULL,
    column_y int,
    CONSTRAINT table_two_pkey PRIMARY KEY (id)
);

更新表格并将NULL值替换为''或其他适当值。

在JOIN加column_y中使用的所有列上创建索引。 column_y必须包含在索引的最后。我假设您的UPDATE格式正确，因此索引应该是唯一的。

CREATE UNIQUE INDEX IX ON table_two
(
    invoice_number,
    submitted_by,
    passport_number,
    driving_license_number,
    national_id_number,
    tax_pin_identification_number,
    vat_number,
    ggcg_number,
    national_association_number,
    column_y
);

查询将变为

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number

请注意，COALESCE仅用于table_one列。

批量执行UPDATE也是一个好主意，而不是一次完成整个表。例如，选择一系列ID以批量更新。

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
table_one.id >= <some_starting_value> AND
table_one.id < <some_ending_value> AND
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number

Answer 5

您可以使用coalesce函数，每当传递的任何变量为null时，它将返回true。空检查功能可以帮到你。

Null-related functions here.

使用WHERE子句更新语句，该子句包含空值为

5 个答案:

更新（关于性能）：