我正在使用另一个表中的数据更新一个表上的列。 WHERE
子句基于多个列,其中一些列为null。根据我的想法,这个空值是throwing off
标准UPDATE TABLE SET X=Y WHERE A=B
语句。
根据table_one
中的数据,查看我尝试更新table_two
的两个表中的this SQL Fiddle。
我的查询目前看起来像这样:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number
对于某些行,查询失败,因为table_one.x
中的任何一列中的任何列都为null
时,Distinct On
未获得更新。即只有在所有列都有一些数据时才会更新。
这个问题与我之前的here on SO有关,我使用EXPLAIN
从大型数据集中获取不同的值。我现在想要的是使用具有唯一字段的表中的值填充大数据集。
更新
我使用了@binotenary提供的第一个更新声明。对于小型表,它会在闪存中运行。示例有一个包含20,000条记录的表,更新在20秒内完成。但到目前为止,另一张有900万条记录的表已经运行了20个小时!请参阅下面Update on table_one (cost=0.00..210634237338.87 rows=13615011125 width=1996)
-> Nested Loop (cost=0.00..210634237338.87 rows=13615011125 width=1996)
Join Filter: ((((my_update_statement_here))))
-> Seq Scan on table_one (cost=0.00..610872.62 rows=9661262 width=1986)
-> Seq Scan on table_two (cost=0.00..6051.98 rows=299998 width=148)
函数的输出
EXPLAIN ANALYZE
allauth
选项也是永远的,所以我取消了它。
有关如何更快地进行此类更新的任何想法?即使它意味着使用不同的更新语句,甚至使用自定义函数循环并执行更新。
答案 0 :(得分:8)
由于null = null
评估为false
,除了等式检查之外,还需要检查两个字段是否null
:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
(table_one.invoice_number = table_two.invoice_number
OR (table_one.invoice_number is null AND table_two.invoice_number is null))
AND
(table_one.submitted_by = table_two.submitted_by
OR (table_one.submitted_by is null AND table_two.submitted_by is null))
AND
-- etc
您还可以使用更具可读性的coalesce
函数:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
AND -- etc
但是你需要注意默认值(coalesce
的最后一个参数)
它的数据类型应该与列类型匹配(例如,您最终不会将日期与数字进行比较),默认情况下应该不会出现在数据中
E.g coalesce(null, 1) = coalesce(1, 1)
是您想要避免的情况。
Seq Scan on table_two
- 这表示您在table_two
上没有任何索引
因此,如果您更新table_one
中的行,然后在table_two
中找到匹配的行,则数据库必须逐个扫描所有行,直到找到匹配为止。
如果对相关列进行索引,则可以更快地找到匹配的行。
另一方面,如果table_one
有任何索引,则会降低更新速度
根据{{3}}:
表约束和索引会严重延迟每次写入。如果可能,您应该在更新运行时删除所有索引,触发器和外键,并在最后重新创建它们。
同一指南的另一个可能有用的建议是:
如果您可以使用(例如)顺序ID对数据进行分段,则可以批量逐步更新行。
例如,如果table_one
id
列,您可以添加类似
and table_one.id between x and y
到where
条件,并多次运行查询,更改x
和y
的值,以便覆盖所有行。
EXPLAIN ANALYZE选项也永远
在处理带有副作用的语句时,使用ANALYZE
选项与EXPLAIN
时可能要小心。
根据{{3}}:
请记住,在使用ANALYZE选项时,实际上会执行该语句。尽管EXPLAIN将丢弃SELECT将返回的任何输出,但该语句的其他副作用将照常发生。
答案 1 :(得分:3)
尝试下面的内容,类似于上面的@binoternary。只是打败了我的答案。
update table_one
set column_x = (select column_y from table_two
where
(( table_two.invoice_number = table_one.invoice_number)OR (table_two.invoice_number IS NULL AND table_one.invoice_number IS NULL))
and ((table_two.submitted_by=table_one.submitted_by)OR (table_two.submitted_by IS NULL AND table_one.submitted_by IS NULL))
and ((table_two.passport_number=table_one.passport_number)OR (table_two.passport_number IS NULL AND table_one.passport_number IS NULL))
and ((table_two.driving_license_number=table_one.driving_license_number)OR (table_two.driving_license_number IS NULL AND table_one.driving_license_number IS NULL))
and ((table_two.national_id_number=table_one.national_id_number)OR (table_two.national_id_number IS NULL AND table_one.national_id_number IS NULL))
and ((table_two.tax_pin_identification_number=table_one.tax_pin_identification_number)OR (table_two.tax_pin_identification_number IS NULL AND table_one.tax_pin_identification_number IS NULL))
and ((table_two.vat_number=table_one.vat_number)OR (table_two.vat_number IS NULL AND table_one.vat_number IS NULL))
and ((table_two.ggcg_number=table_one.ggcg_number)OR (table_two.ggcg_number IS NULL AND table_one.ggcg_number IS NULL))
and ((table_two.national_association_number=table_one.national_association_number)OR (table_two.national_association_number IS NULL AND table_one.national_association_number IS NULL))
);
答案 2 :(得分:1)
您可以使用Oracle的NVL之类的空检查功能。
对于Postgres,您必须使用ArrayList()
。
即。您的查询可能如下所示:
UPDATE table_one SET table_one.x =(select table_two.y from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1))
where table_one.table_one_pk in (select table_one.table_one_pk from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1));
答案 3 :(得分:1)
您当前的查询使用Nested Loop
连接两个表,这意味着服务器处理
9,661,262 * 299,998 = 2,898,359,277,476
行。难怪它需要永远。
要使连接有效,您需要在所有连接列上使用索引。问题是NULL
值。
如果在连接列上使用函数,通常不能使用索引。
如果您在JOIN
:
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
无法使用索引。
因此,我们需要一个索引,我们需要对NULL
值执行某些操作以使索引可用。
我们不需要在table_one
进行任何更改,因为无论如何都必须对其进行全面扫描。
但是,table_two
肯定可以改进。要么更改表本身,要么创建单独的(临时)表。它只有300K行,所以它应该不是问题。
使JOIN
中使用的所有列都为NOT NULL
。
CREATE TABLE table_two (
id int4 NOT NULL,
invoice_number varchar(30) NOT NULL,
submitted_by varchar(20) NOT NULL,
passport_number varchar(30) NOT NULL,
driving_license_number varchar(30) NOT NULL,
national_id_number varchar(30) NOT NULL,
tax_pin_identification_number varchar(30) NOT NULL,
vat_number varchar(30) NOT NULL,
ggcg_number varchar(30) NOT NULL,
national_association_number varchar(30) NOT NULL,
column_y int,
CONSTRAINT table_two_pkey PRIMARY KEY (id)
);
更新表格并将NULL
值替换为''
或其他适当值。
在JOIN
加column_y
中使用的所有列上创建索引。 column_y
必须包含在索引的最后。我假设您的UPDATE
格式正确,因此索引应该是唯一的。
CREATE UNIQUE INDEX IX ON table_two
(
invoice_number,
submitted_by,
passport_number,
driving_license_number,
national_id_number,
tax_pin_identification_number,
vat_number,
ggcg_number,
national_association_number,
column_y
);
查询将变为
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
请注意,COALESCE
仅用于table_one
列。
批量执行UPDATE
也是一个好主意,而不是一次完成整个表。例如,选择一系列ID以批量更新。
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.id >= <some_starting_value> AND
table_one.id < <some_ending_value> AND
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
答案 4 :(得分:0)
您可以使用coalesce函数,每当传递的任何变量为null时,它将返回true。空检查功能可以帮到你。