MSSQL过程 - 更新大规模表时的性能注意事项

时间:2016-12-13 18:07:12

标签: sql-server stored-procedures

我有两张桌子

  1. existing_bactria(可能包含数百万行)
  2. new_bactria(可能包含数百万行)
  3. 样本表:

      CREATE TABLE [dbo].[existing_bacteria](
        [bacteria_name] [nchar](10) NULL,
        [bacteria_type] [nchar](10) NULL,
        [bacteria_sub_type] [nchar](10) NULL,
        [bacteria_size] [nchar](10) NULL,
        [bacteria_family] [nchar](10) NULL,
        [bacteria_discovery_year] [date] NOT NULL
    )
    
      CREATE TABLE [dbo].[new_bacteria](
      [existing_bacteria_name] [nchar](10) NULL,
        [bacteria_type] [nchar](10) NULL,
        [bacteria_sub_type] [nchar](10) NULL,
        [bacteria_size] [nchar](10) NULL,
        [bacteria_family] [nchar](10) NULL,
        [bacteria_discovery_year] [date] NOT NULL
    
    )
    

    我需要创建一个存储过程来更新new_bactria表,其中可能与existing_bactria匹配(更新字段new_bactria.existing_bacteria_name   通过在[existing_bacteria]的其他字段中找到匹配项(假设在existing_bacteria中只有一条记录)

    由于表格很大(每个数百万条记录),我希望您对如何解决方案有所了解,这是我到目前为止所得到的:

    解决方案1:

    显而易见的解决方案是将所有内容提取到游标中并迭代结果并更新existing_bacteria
    但由于有数百万条记录 - 它不是最佳解决方案

    -- pseudo code  
           db_cursor  as select * from new_bacteria
           OPEN db_cursor  
                FETCH NEXT FROM db_cursor INTO @row
    
            WHILE @@FETCH_STATUS = 0  
            BEGIN  
    
        IF EXISTS (
                SELECT
               @bacteria_name = [bacteria_name]
              ,@bacteria_type = [bacteria_type]      
              ,@bacteria_size = [bacteria_size]
          FROM [dbo].[existing_bacteria]
                  where [bacteria_type] = @row.[bacteria_type] and @row.[bacteria_size] = [bacteria_size]
        )
        BEGIN
          PRINT 'update new_bacteria.existing_bacteria_name  with [bacteria_name] we found.';
        END
            -- go to next record
            FETCH NEXT FROM db_cursor INTO @name  
        END  
    

    解决方案2:

    solution2是在mssql过程中加入两个表 并迭代结果,但这也是

     -- pseudo code
        select * from [new_bacteria] 
        inner join [existing_bacteria]
            on [new_bacteria].bacteria_size = [existing_bacteria].bacteria_size
            and [new_bacteria].bacteria_family = [existing_bacteria].bacteria_family
    
        for each result update [existing_bacteria]
    

    我确信这不是最佳的,因为表大小和迭代

    解决方案3:

    solution3是让db处理数据并使用内部Join:

    直接更新表
     -- pseudo code
    UPDATE R 
    SET R.existing_bacteria_name = p.[bacteria_name]
    FROM [new_bacteria] AS R
     inner join [existing_bacteria] P
            on R.bacteria_size = P.bacteria_size
            and R.bacteria_family = P.bacteria_family
    

    我不确定这个解决方案。

3 个答案:

答案 0 :(得分:2)

基于你的伪代码,我会选择解决方案3,因为它是一个基于集合的操作,应该比使用游标或其他循环快得多。

如果您遇到解决方案3的性能问题......

并且您没有这些表的索引,特别是那些用于连接这两个表的列,创建这些表会有所帮助。

create unique index uix_new_bacteria_bacteria_size_bacteria_family 
  on [new_bacteria] (bacteria_size,bacteria_family);

create unique index uix_existing_bacteria_bacteria_size_bacteria_family 
  on [existing_bacteria] (bacteria_size,bacteria_family) include (bacteria_name);

然后尝试:

update r 
    set r.existing_bacteria_name = p.[bacteria_name]
  from [new_bacteria] AS R
    inner join [existing_bacteria] P on R.bacteria_size = P.bacteria_size
      and R.bacteria_family = P.bacteria_family;

更新几百万行不应该是正确索引的问题。

<小时/> 在更新问题后,此部分不再相关 如果bacteria_sizebacteria_family不是唯一集合,则可能存在另一个问题,您可能会有多个匹配项。 (因为它们可以为空,我认为除非你使用过滤索引,否则它们不是唯一的)

在这种情况下,在继续前进之前,我会创建一个表来调查多个这样的匹配:

create table [dbo].[new_and_existing_bacteria_matches](
    [existing_bacteria_name] [nchar](10) not null,
    rn int not null,
    [bacteria_type] [nchar](10) null,
    [bacteria_sub_type] [nchar](10) null,
    [bacteria_size] [nchar](10) null,
    [bacteria_family] [nchar](10) null,
    [bacteria_discovery_year] [date] not null,
    constraint pk_new_and_existing primary key clustered ([existing_bacteria_name], rn)
);

insert into [new_and_existing_bacteria_matches]
  ([existing_bacteria_name],rn,[bacteria_type],[bacteria_sub_type],[bacteria_size],[bacteria_family],[bacteria_discovery_year])

select 
    e.[existing_bacteria_name]
  , rn = row_number() over (partition by e.[existing_bacteria_name] order by n.[bacteria_type], n.[bacteria_sub_type])
  , n.[bacteria_type]
  , n.[bacteria_sub_type]
  , n.[bacteria_size]
  , n.[bacteria_family]
  , n.[bacteria_discovery_year]
from [new_bacteria] as n
  inner join [existing_bacteria] e on n.bacteria_size = e.bacteria_size
    and n.bacteria_family = e.bacteria_family;

-- and query multiple matches with something like this:
select * 
  from [new_and_existing_bacteria_matches] n
  where exists (
    select 1 
      from [new_and_existing_bacteria_matches] i 
      where i.[existing_bacteria_name]=n.[existing_bacteria_name]
        and rn>1
        );

答案 1 :(得分:2)

关于表现的主题我会看:

  1. 数据库的“恢复模型”,如果您的DBA表示您可以在“简单模式”下进行,那么您希望尽可能减少日志记录。
  2. 考虑在TARGET表上禁用某些索引,然后在完成后重建它们。在大规模操作中,对索引的修改将导致额外的日志记录,并且索引的操作将占用缓冲池中的空间。
  3. 您可以将NCHAR转换为CHAR,它将需要更少的存储空间,从而减少IO,释放缓冲区空间并减少日志记录。
  4. 如果目标表没有Clustered index,则尝试激活“TraceFlag 610”(警告这是一个实例范围的设置,请与您的DBA联系)
  5. 如果您的环境允许,使用TABLOCKX提示可以消除锁定开销,还有助于满足减少日志记录的标准。
  6. 对于必须执行批量插入或大规模更新的任何人,Microsoft的white paper是有价值的读物:

    您可以尝试使用MERGE语句。它将在单次传递数据时执行操作。 (合并的问题在于它尝试在一个事务中执行所有操作,并且您最终可能会在执行计划中使用不需要的假脱机。然后我会转向批处理过程,一次循环可能超过100,000条记录。)< / p>

    (需要进行一些小的更改以满足您的列匹配/更新要求)

    MERGE [dbo].[new_bacteria] T    --TARGET TABLE
    USING [dbo].[existing_bacteria] S --SOURCE TABLE
    ON 
        S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
        AND S.[bacteria_type] = T.[bacteria_type] 
    WHEN MATCHED 
        AND 
        ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
        OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'') 
    
        THEN    --UPDATE RECORDS THAT HAVE CHANGED
        UPDATE
        SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
    WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
        INSERT( 
            [existing_bacteria_name],
            [bacteria_type],
            [bacteria_sub_type],
            [bacteria_size],
            [bacteria_family],
            [bacteria_discovery_year]
            )
        VALUES(
            s.[bacteria_name],
            s.[bacteria_type],
            s.[bacteria_sub_type],
            s.[bacteria_size],
            s.[bacteria_family],
            s.[bacteria_discovery_year]
            );
    

    如果单个MERGE对于您的系统来说太多了,可以使用以下方法将其嵌入到更新大批量的循环中。您可以修改批量大小以匹配服务器的功能。

    它通过使用几个临时表来确保是否出现任何问题(即服务器代理重启),该过程可以从中断的地方继续。 (如果您有任何疑问,请询问)。

        --CAPTURE WHAT HAS CHANGED SINCE THE LAST TIME THE SP WAS RUN
        --EXCEPT is a usefull command because it can compare NULLS, this removes the need for ISNULL  or COALESCE
        INSERT INTO [dbo].[existing_bacteria_changes]
        SELECT
            *
        FROM
            [dbo].[existing_bacteria]
        EXCEPT
        SELECT
            *
        FROM
            [dbo].[new_bacteria]
    
    --RUN FROM THIS POINT IN THE EVENT OF A FAILURE
    
        DECLARE @R INT = 1
        DECLARE @Batch INT = 100000
    
        WHILE  @R > 0
            BEGIN
                BEGIN TRAN  --CARRY OUT A TRANSACTION WITH A SUBSET OF DATA
    
                --USE DELETE WITH OUTPUT TO MOVE A BATCH OF RECORDS INTO A HOLDING AREA.
                --The holding area will provide a rollback point so if the job fails at any point it will restart from where it last was.
                DELETE TOP (@Batch)
                FROM [dbo].[existing_bacteria_changes]
                OUTPUT DELETED.* INTO [dbo].[existing_bacteria_Batch]
                @@ROWCOUNT
    
                --LOG THE NUMBER OF RECORDS IN THE UPDATE SET, THIS WILL ENSURE THE NEXT ITTERATION
                SET @R = ISNULL(@@ROWCOUNT,0)
    
                --RUN THE MERGE STATEMENT WITH THE SUBSET OF UPDATES
                MERGE [dbo].[new_bacteria] T    --TARGET TABLE
                USING [dbo].[existing_bacteria_Batch] S --SOURCE TABLE
                ON 
                    S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
                    AND S.[bacteria_type] = T.[bacteria_type] 
                WHEN MATCHED 
                    AND 
                    ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
                    OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'') 
    
                    THEN    --UPDATE RECORDS THAT HAVE CHANGED
                    UPDATE
                    SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
                WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
                    INSERT( 
                        [existing_bacteria_name],
                        [bacteria_type],
                        [bacteria_sub_type],
                        [bacteria_size],
                        [bacteria_family],
                        [bacteria_discovery_year]
                        )
                    VALUES(
                        s.[bacteria_name],
                        s.[bacteria_type],
                        s.[bacteria_sub_type],
                        s.[bacteria_size],
                        s.[bacteria_family],
                        s.[bacteria_discovery_year]
                        );
    
                COMMIT;
    
                --No point in logging this action
                TRUNCATE [dbo].[existing_bacteria_Batch]
    
            END
    

答案 2 :(得分:1)

绝对选项3 。基于SET总是从任何循环中获胜。

那就是最大的风险&#39;可能是更新数据的数量超过&#39;你的机器。更具体地说,可能发生交易变得如此之大以至于系统需要永远完成它。为避免这种情况,您可以尝试将一个较大的UPDATE拆分为多个较小的UPDATE,然后仍然可以基于设置。良好的索引和了解您的数据是关键。

例如,从

开始
UPDATE R 
   SET R.existing_bacteria_name = p.[bacteria_name]
  FROM [new_bacteria] AS R
 INNER JOIN [existing_bacteria] P
         ON R.bacteria_size   = P.bacteria_size
        AND R.bacteria_family = P.bacteria_family

你可以试试&#39; chunk&#39;将(目标)表分成更小的部分。例如。通过在bacteria_discovery_year字段上进行循环,假设所述列将表拆分为例如50个或多或少相同大小的部件。 (顺便说一句:我没有生物学家,所以我可能完全错了=)

然后你会得到以下内容:

DECLARE @c_bacteria_discovery_year date

DECLARE year_loop CURSOR LOCAL STATIC
    FOR SELECT DISTINCT bacteria_discovery_year
          FROM [new_bacteria] 
         ORDER BY bacteria_discovery_year
OPEN year_loop 
FETCH NEXT FROM year_loop INTO @c_bacteria_discovery_year
WHILE @@FETCH_STATUS = 0
    BEGIN

        UPDATE R 
           SET R.existing_bacteria_name = p.[bacteria_name]
          FROM [new_bacteria] AS R
         INNER JOIN [existing_bacteria] P
                 ON R.bacteria_size   = P.bacteria_size
                AND R.bacteria_family = P.bacteria_family
         WHERE R.bacteria_discovery_year = @c_bacteria_discovery_year

        FETCH NEXT FROM year_loop INTO @c_bacteria_discovery_year

    END
CLOSE year_loop
DEALLOCATE year_loop

一些评论:

  • 就像我说的那样,我不知道bacteria_discovery_year值的分布,如果3年构成95%的数据,它可能不是一个很好的选择。
  • 仅当bacteria_discovery_year列上有索引时才会生效,最好包含bacteria_sizebacteria_family
  • 你可以在循环中添加一些PRINT以查看进度和受影响的行...它不会加速任何事情,但如果你知道它做某事感觉会更好= )
  • 总而言之,不要过度,如果你把它分成太多的小块,你最终会得到一些永远的东西。

PS:无论如何,您还需要一个关于“来源”的索引。索引bacteria_sizebacteria_family列的表,如果后者不是表的(聚集的)PK,则最好包括bacteria_name