我有两张桌子
样本表:
CREATE TABLE [dbo].[existing_bacteria](
[bacteria_name] [nchar](10) NULL,
[bacteria_type] [nchar](10) NULL,
[bacteria_sub_type] [nchar](10) NULL,
[bacteria_size] [nchar](10) NULL,
[bacteria_family] [nchar](10) NULL,
[bacteria_discovery_year] [date] NOT NULL
)
CREATE TABLE [dbo].[new_bacteria](
[existing_bacteria_name] [nchar](10) NULL,
[bacteria_type] [nchar](10) NULL,
[bacteria_sub_type] [nchar](10) NULL,
[bacteria_size] [nchar](10) NULL,
[bacteria_family] [nchar](10) NULL,
[bacteria_discovery_year] [date] NOT NULL
)
我需要创建一个存储过程来更新new_bactria表,其中可能与existing_bactria匹配(更新字段new_bactria.existing_bacteria_name 通过在[existing_bacteria]的其他字段中找到匹配项(假设在existing_bacteria中只有一条记录)
由于表格很大(每个数百万条记录),我希望您对如何解决方案有所了解,这是我到目前为止所得到的:
解决方案1:
显而易见的解决方案是将所有内容提取到游标中并迭代结果并更新existing_bacteria
但由于有数百万条记录 - 它不是最佳解决方案
-- pseudo code
db_cursor as select * from new_bacteria
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @row
WHILE @@FETCH_STATUS = 0
BEGIN
IF EXISTS (
SELECT
@bacteria_name = [bacteria_name]
,@bacteria_type = [bacteria_type]
,@bacteria_size = [bacteria_size]
FROM [dbo].[existing_bacteria]
where [bacteria_type] = @row.[bacteria_type] and @row.[bacteria_size] = [bacteria_size]
)
BEGIN
PRINT 'update new_bacteria.existing_bacteria_name with [bacteria_name] we found.';
END
-- go to next record
FETCH NEXT FROM db_cursor INTO @name
END
解决方案2:
solution2是在mssql过程中加入两个表 并迭代结果,但这也是
-- pseudo code
select * from [new_bacteria]
inner join [existing_bacteria]
on [new_bacteria].bacteria_size = [existing_bacteria].bacteria_size
and [new_bacteria].bacteria_family = [existing_bacteria].bacteria_family
for each result update [existing_bacteria]
我确信这不是最佳的,因为表大小和迭代
解决方案3:
solution3是让db处理数据并使用内部Join:
直接更新表 -- pseudo code
UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
inner join [existing_bacteria] P
on R.bacteria_size = P.bacteria_size
and R.bacteria_family = P.bacteria_family
我不确定这个解决方案。
答案 0 :(得分:2)
基于你的伪代码,我会选择解决方案3,因为它是一个基于集合的操作,应该比使用游标或其他循环快得多。
如果您遇到解决方案3的性能问题......
并且您没有这些表的索引,特别是那些用于连接这两个表的列,创建这些表会有所帮助。
create unique index uix_new_bacteria_bacteria_size_bacteria_family
on [new_bacteria] (bacteria_size,bacteria_family);
create unique index uix_existing_bacteria_bacteria_size_bacteria_family
on [existing_bacteria] (bacteria_size,bacteria_family) include (bacteria_name);
然后尝试:
update r
set r.existing_bacteria_name = p.[bacteria_name]
from [new_bacteria] AS R
inner join [existing_bacteria] P on R.bacteria_size = P.bacteria_size
and R.bacteria_family = P.bacteria_family;
更新几百万行不应该是正确索引的问题。
<小时/> 在更新问题后,此部分不再相关 如果
bacteria_size
和bacteria_family
不是唯一集合,则可能存在另一个问题,您可能会有多个匹配项。
(因为它们可以为空,我认为除非你使用过滤索引,否则它们不是唯一的)
在这种情况下,在继续前进之前,我会创建一个表来调查多个这样的匹配:
create table [dbo].[new_and_existing_bacteria_matches](
[existing_bacteria_name] [nchar](10) not null,
rn int not null,
[bacteria_type] [nchar](10) null,
[bacteria_sub_type] [nchar](10) null,
[bacteria_size] [nchar](10) null,
[bacteria_family] [nchar](10) null,
[bacteria_discovery_year] [date] not null,
constraint pk_new_and_existing primary key clustered ([existing_bacteria_name], rn)
);
insert into [new_and_existing_bacteria_matches]
([existing_bacteria_name],rn,[bacteria_type],[bacteria_sub_type],[bacteria_size],[bacteria_family],[bacteria_discovery_year])
select
e.[existing_bacteria_name]
, rn = row_number() over (partition by e.[existing_bacteria_name] order by n.[bacteria_type], n.[bacteria_sub_type])
, n.[bacteria_type]
, n.[bacteria_sub_type]
, n.[bacteria_size]
, n.[bacteria_family]
, n.[bacteria_discovery_year]
from [new_bacteria] as n
inner join [existing_bacteria] e on n.bacteria_size = e.bacteria_size
and n.bacteria_family = e.bacteria_family;
-- and query multiple matches with something like this:
select *
from [new_and_existing_bacteria_matches] n
where exists (
select 1
from [new_and_existing_bacteria_matches] i
where i.[existing_bacteria_name]=n.[existing_bacteria_name]
and rn>1
);
答案 1 :(得分:2)
关于表现的主题我会看:
对于必须执行批量插入或大规模更新的任何人,Microsoft的white paper是有价值的读物:
您可以尝试使用MERGE语句。它将在单次传递数据时执行操作。 (合并的问题在于它尝试在一个事务中执行所有操作,并且您最终可能会在执行计划中使用不需要的假脱机。然后我会转向批处理过程,一次循环可能超过100,000条记录。)< / p>
(需要进行一些小的更改以满足您的列匹配/更新要求)
MERGE [dbo].[new_bacteria] T --TARGET TABLE
USING [dbo].[existing_bacteria] S --SOURCE TABLE
ON
S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
AND S.[bacteria_type] = T.[bacteria_type]
WHEN MATCHED
AND
ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'')
THEN --UPDATE RECORDS THAT HAVE CHANGED
UPDATE
SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
INSERT(
[existing_bacteria_name],
[bacteria_type],
[bacteria_sub_type],
[bacteria_size],
[bacteria_family],
[bacteria_discovery_year]
)
VALUES(
s.[bacteria_name],
s.[bacteria_type],
s.[bacteria_sub_type],
s.[bacteria_size],
s.[bacteria_family],
s.[bacteria_discovery_year]
);
如果单个MERGE对于您的系统来说太多了,可以使用以下方法将其嵌入到更新大批量的循环中。您可以修改批量大小以匹配服务器的功能。
它通过使用几个临时表来确保是否出现任何问题(即服务器代理重启),该过程可以从中断的地方继续。 (如果您有任何疑问,请询问)。
--CAPTURE WHAT HAS CHANGED SINCE THE LAST TIME THE SP WAS RUN
--EXCEPT is a usefull command because it can compare NULLS, this removes the need for ISNULL or COALESCE
INSERT INTO [dbo].[existing_bacteria_changes]
SELECT
*
FROM
[dbo].[existing_bacteria]
EXCEPT
SELECT
*
FROM
[dbo].[new_bacteria]
--RUN FROM THIS POINT IN THE EVENT OF A FAILURE
DECLARE @R INT = 1
DECLARE @Batch INT = 100000
WHILE @R > 0
BEGIN
BEGIN TRAN --CARRY OUT A TRANSACTION WITH A SUBSET OF DATA
--USE DELETE WITH OUTPUT TO MOVE A BATCH OF RECORDS INTO A HOLDING AREA.
--The holding area will provide a rollback point so if the job fails at any point it will restart from where it last was.
DELETE TOP (@Batch)
FROM [dbo].[existing_bacteria_changes]
OUTPUT DELETED.* INTO [dbo].[existing_bacteria_Batch]
@@ROWCOUNT
--LOG THE NUMBER OF RECORDS IN THE UPDATE SET, THIS WILL ENSURE THE NEXT ITTERATION
SET @R = ISNULL(@@ROWCOUNT,0)
--RUN THE MERGE STATEMENT WITH THE SUBSET OF UPDATES
MERGE [dbo].[new_bacteria] T --TARGET TABLE
USING [dbo].[existing_bacteria_Batch] S --SOURCE TABLE
ON
S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
AND S.[bacteria_type] = T.[bacteria_type]
WHEN MATCHED
AND
ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'')
THEN --UPDATE RECORDS THAT HAVE CHANGED
UPDATE
SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
INSERT(
[existing_bacteria_name],
[bacteria_type],
[bacteria_sub_type],
[bacteria_size],
[bacteria_family],
[bacteria_discovery_year]
)
VALUES(
s.[bacteria_name],
s.[bacteria_type],
s.[bacteria_sub_type],
s.[bacteria_size],
s.[bacteria_family],
s.[bacteria_discovery_year]
);
COMMIT;
--No point in logging this action
TRUNCATE [dbo].[existing_bacteria_Batch]
END
答案 2 :(得分:1)
绝对选项3 。基于SET总是从任何循环中获胜。
那就是最大的风险&#39;可能是更新数据的数量超过&#39;你的机器。更具体地说,可能发生交易变得如此之大以至于系统需要永远完成它。为避免这种情况,您可以尝试将一个较大的UPDATE
拆分为多个较小的UPDATE
,然后仍然可以基于设置。良好的索引和了解您的数据是关键。
例如,从
开始UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
INNER JOIN [existing_bacteria] P
ON R.bacteria_size = P.bacteria_size
AND R.bacteria_family = P.bacteria_family
你可以试试&#39; chunk&#39;将(目标)表分成更小的部分。例如。通过在bacteria_discovery_year
字段上进行循环,假设所述列将表拆分为例如50个或多或少相同大小的部件。 (顺便说一句:我没有生物学家,所以我可能完全错了=)
然后你会得到以下内容:
DECLARE @c_bacteria_discovery_year date
DECLARE year_loop CURSOR LOCAL STATIC
FOR SELECT DISTINCT bacteria_discovery_year
FROM [new_bacteria]
ORDER BY bacteria_discovery_year
OPEN year_loop
FETCH NEXT FROM year_loop INTO @c_bacteria_discovery_year
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
INNER JOIN [existing_bacteria] P
ON R.bacteria_size = P.bacteria_size
AND R.bacteria_family = P.bacteria_family
WHERE R.bacteria_discovery_year = @c_bacteria_discovery_year
FETCH NEXT FROM year_loop INTO @c_bacteria_discovery_year
END
CLOSE year_loop
DEALLOCATE year_loop
一些评论:
bacteria_discovery_year
值的分布,如果3年构成95%的数据,它可能不是一个很好的选择。bacteria_discovery_year
列上有索引时才会生效,最好包含bacteria_size
和bacteria_family
。PRINT
以查看进度和受影响的行...它不会加速任何事情,但如果你知道它做某事感觉会更好= ) PS:无论如何,您还需要一个关于“来源”的索引。索引bacteria_size
和bacteria_family
列的表,如果后者不是表的(聚集的)PK,则最好包括bacteria_name
。