A simpler, more efficient way to remove duplicate rows from an SQL Server table

时间:2016-02-12 21:59:20

标签: sql sql-server tsql sql-server-2008-r2 duplicates

Overview I have a table (EODBalances) in SQL Server 2008 R2 that has a significant # of rows (~ 200 million). Essentially it is in an accounting system (general ledger) and its role is to store the closing balance for each and every account in the accounting system.

The table definition

grunt.registerTask('default', ['uncss', 'cssmin']);

The task The # of accounts is growing exponentially and hence causing the # of rows in the EODBalances table to grow similarly. Apart from the exponential growth one of the existing issues is that we add a new row for every account every day even when there is no change to the account balance. My task is to reduce the # of rows in this table by removing duplicate rows for each account. I have refactored the stored proc that updates this table every night so that it will only add a new row if the balance has changed. This is of course will only occur going forward.

The problem The task I am having challenges on is the cleanup of the historical rows in the table which is a specific type of problem to do with duplicate removal. I need to keep the original (first) entry of any account balance in the table but remove any subsequent rows where the closing balance doesn't change. As soon as it changes I need to keep that specific row and then again remove subsequent rows until it changes again. And so on...

I have tried a few different ways to achieve this but all of them are very inefficient and in addition to the time they take to run, have side effects like massive log files (which are a pain when the database is log shipped). the current solution I have is to create a copy of the table and copy out the rows I want to keep into the copy and delete them from the original table. Once this is done then I delete the original table and rename the copy back to the original name. This works but take more hours than I have in the upgrade window available.

Has anyone had a similar issue and found a better way to deal with it?

2 个答案:

答案 0 :(得分:2)

以下是我为类似情况提出的流程概述:

  • 设计算法以识别要删除的重复行。使用group by,min(),max(),row_number(),无论如何,有几种方法可以做到这一点,它们已经被多次发布了,听起来你已经有了这个。

  • 如你所知,这是一大堆工作要做。

  • 将大部分作品分成几块,一次处理一块。随着时间的推移传播这项工作,以控制您的事务日志大小。如果(比方说)你每小时进行t-log备份,那么每小时只运行几次这个过程,以保持事务日志变小,并且t-log备份文件也不会失控。

  • 如何拆分?根据您的数据,我会通过AccountId说。在每个批次中处理一个数字(1,10,100,1000?),无论大小对您的条件是否合理(参见上面的事务日志膨胀)。

  • 如何管理这一切?创建“清除日志”表。使用需要检查的所有AccountId填充它(即您不必向其添加新帐户)。进行某种形式的循环,每个帐户运行一次删除例程,或者每10个帐户运行一次,或者其他任何操作。清除后,将清除日志表中的帐户标记为“已处理”,不要再次处理。记录删除的行数以及工作完成时间,以便跟踪进度。

    • 最后一步是安排。将其全部设为存储过程,并配置SQL代理作业以将该过程调用X次(t-log备份周期)。安排它在可行的窗口期间运行 - 如果它是“非侵入性的”,则整天运行;如果系统足够清晰,则安排在周日的上午时间运行。 (我周末有一个小时跑了16个小时,在“最后一个”差异备份和每周完整备份之间。)
  • 让它运行直到工作完成。如果必须尽快完成工作,则可能需要在日志大小,工作时间的性能以及其他任何方面支付优惠。

答案 1 :(得分:2)

我会为此创建一个新表,然后重新加载数据。识别行并不难。您需要识别组。它是这样的:

select e.*,
       row_number() over (partition by AccountId, balance, grp order by created) as seqnum
from (select e.*,
             (row_number() over (partition by AccountId order by created) -
              row_number() over (partition by AccountId, balance order by created)
             ) as grp
      from EODBalances e
     ) e;

带有seqnum的行首先出现。

然后我会做这样的事情:

select *
into temp_EODBalances
from (select e.*,
             row_number() over (partition by AccountId, balance, grp order by created) as seqnum
      from (select e.*,
                   (row_number() over (partition by AccountId order by created) - 
                    row_number() over (partition by AccountId, balance order by created)
                   ) as grp
            from EODBalances e
           ) e
      ) e
where seqnum = 1;

然后,我会从桌子上测试“地狱”。最后,当满意时(以及在备份原始表之后),我会这样做:

truncate table EODBalances;

insert into EODBalances(. . . )
    select . . .
    from temp_EODBalances;