Question

我有一张约有1700万行的表格。我需要重复删除表中的行。在正常情况下，这不是一个挑战，但这不是一个正常的情况。通常，“重复行”被定义为包含所有列的完全相同值的两行或更多行。在这种情况下，“重复行”被定义为两个或更多行具有完全相同的值，但彼此之间也在20秒内。我写了一个在19.5小时后仍然运行的脚本，这是不可接受的，但我不知道怎么做。这是脚本：

begin
create table ##dupes (ID  int)
declare curOriginals cursor for 
select ID, AssociatedEntityID, AssociatedEntityType, [Timestamp] from tblTable

declare @ID    int
declare @AssocEntity int
declare @AssocType  int
declare @Timestamp  datetime
declare @Count   int

open curOriginals
fetch next from curOriginals into @ID, @AssocEntity, @AssocType, @Timestamp
while @@FETCH_STATUS = 0
begin
select @Count = COUNT(*) from tblTable where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType 
and [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
and [Timestamp] <= DATEADD(ss, 20, @Timestamp) 
and ID <> @ID
if (@Count > 0)
begin
insert into ##dupes (ID) 
(select ID from tblHBMLog where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType 
and [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
and [Timestamp] <= DATEADD(ss, 20, @Timestamp) 
and ID <> @ID)
print @ID
end
delete from tblHBMLog where ID = @ID or ID in (select ID from ##dupes)
fetch next from curOriginals into @ID, @AssocEntity, @AssocType, @Timestamp
end

close curOriginals
deallocate curOriginals

select * from ##dupes
drop table ##dupes
end

非常感谢任何帮助。

Answer 1

应该获得一些速度的快速调整是用一些EXISTS替换讨厌的COUNT部分：

IF EXISTS(SELECT 1 FROM tblTable WHERE AssociatedEntityID = @AssocEntity
    AND AssociatedEntityType = @AssocType AND [Timestamp] >= DATEADD(ss, -20, @Timestamp)
    AND [Timestamp] <= DATEADD(ss, 20, @Timestamp)
    AND ID <> @ID) //if there are any matching rows...
BEGIN
    DELETE FROM tblHBMLog
    OUTPUT deleted.ID INTO ##dupes
    WHERE AssociatedEntityID = @AssocEntity AND AssociatedEntityType = @AssocType 
        AND [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
        AND [Timestamp] <= DATEADD(ss, 20, @Timestamp) //I think this is supposed to be within the block, not outside it
END

我现在也用OUTPUT子句替换了## dupes的双引用，这意味着每次删除行时都不会扫描不断增长的## dupes。就删除而言，当您一次删除ID及其匹配时，您不需要这样精细的删除子句。您已经检查过哪些条目需要删除，您似乎想删除所有条目，包括原始条目。

一旦你回答保罗的问题，我们就可以看一下完全删除光标。

Answer 2

基本上，我同意鲍勃的观点。首先，你的代码中有太多的东西需要重复1700万次。 2，你可以将你的设置裁剪为绝对重复。第三，如果你有足够的内存（你应该这样做）会更好，并尝试用你选择的编程语言解决这个问题。

无论如何，为了硬编码的答案，并且因为你的查询可能仍在运行，我会尝试提供一个我认为（？）做你想要的工作脚本。

首先你应该有一个索引。我建议在AssociatedEntityID字段上使用索引。如果您已经有一个，但是在创建索引后，您的表已经填充了大量数据，则删除它并重新创建它，以获得新的统计信息。

然后看下面的脚本，它执行以下操作：

转储## dupes中的所有重复项，忽略20秒规则
将它们排序（通过AssociatedEntityID，Timestamp）并启动它可以做的最简单的直接循环。
检查重复的AssociatedEntityID和20秒间隔内的时间戳。如果全部为true，则将id插入## dupes_to_be_deleted表。

假设如果按顺序有一组两个以上的重复项，那么脚本将消除第一个重复的20秒范围内的每个副本。然后，从下一个剩余的，如果有的话，它重置并再持续20秒，依此类推......

这是脚本，它可能对您有用，但没有时间对其进行测试

CREATE TABLE ##dupes
             (
                          ID                 INT ,
                          AssociatedEntityID INT ,
                          [Timestamp]        DATETIME
             )
CREATE TABLE ##dupes_to_be_deleted
             (
                          ID INT
             )

-- collect all dupes, ignoring for now the rule of 20 secs
INSERT
INTO   ##dupes
SELECT ID                 ,
       AssociatedEntityID ,
       [Timestamp]
FROM   tblTable
WHERE  AssociatedEntityID IN
       ( SELECT  AssociatedEntityID
       FROM     tblTable
       GROUP BY AssociatedEntityID
       HAVING   COUNT(*) > 1
       )

-- then sort and loop on all of them
-- using a cursor
DECLARE c CURSOR FOR
SELECT   ID                 ,
         AssociatedEntityID ,
         [Timestamp]
FROM     ##dupes
ORDER BY AssociatedEntityID,
         [Timestamp]

-- declarations
DECLARE @id                     INT,
        @AssociatedEntityID     INT,
        @ts                     DATETIME,
        @old_AssociatedEntityID INT,
        @old_ts                 DATETIME

-- initialisation
SELECT @old_AssociatedEntityID = 0,
       @old_ts                 = '1900-01-01'

-- start loop
OPEN c
FETCH NEXT
FROM  c
INTO  @id                ,
      @AssociatedEntityID,
      @ts
WHILE @@fetch_status = 0
BEGIN
        -- check for dupe AssociatedEntityID
        IF @AssociatedEntityID = @old_AssociatedEntityID
        BEGIN
                -- check for time interval
                IF @ts <= DATEADD(ss, 20, @old_ts )
                BEGIN
                        -- yes! it is a duplicate
                        -- store it in ##dupes_to_be_deleted
                        INSERT
                        INTO   ##dupes_to_be_deleted
                               (
                                      id
                               )
                               VALUES
                               (
                                      @id
                               )
                END
                ELSE
                BEGIN
                        -- IS THIS OK?:
                        -- put last timestamp for comparison
                        -- with the next timestamp
                        -- only if the previous one is not going to be deleted.
                        -- this way we delete all duplicates
                        -- 20 secs away from the first of the set of duplicates
                        -- and the next one remaining will be a duplicate
                        -- but after the 20 secs interval.
                        -- and so on ...
                        SET @old_ts = @ts
                END
        END

        -- prepare vars for next iteration
        SELECT @old_AssociatedEntityID = @AssociatedEntityID
        FETCH NEXT
        FROM  c
        INTO  @id                ,
              @AssociatedEntityID,
              @ts
END
CLOSE c
DEALLOCATE c


-- now you have all the ids that are duplicates and in the 20 sec interval of the first duplicate in the ##dupes_to_be_deleted
DELETE
FROM       <wherever> -- replace <wherever> with tblHBMLog?
WHERE  id IN
       ( SELECT id
       FROM    ##dupes_to_be_deleted
       )
DROP TABLE ##dupes_to_be_deleted
DROP TABLE ##dupes

你可以尝试一下，然后离开几个小时。希望它有所帮助。

Answer 3

如果你有足够的内存和存储空间，可能会更快：

创建具有类似结构的新表
通过select选择与此临时表不同的所有数据
清除原始表（你应该在此之前删除一些约束）
将数据复制回原始表

您可以重命名drop原始表并重命名temp文件夹，而不是3步和4步。

Answer 4

把时间差分放在一边，我要做的第一件事就是将这个列表缩小到可能重复的一小部分。例如，如果你有1700万行，但只有1000万个字段匹配但时间匹配，那么你刚刚砍掉了很大一部分处理。

要做到这一点，我只是提出一个查询，将潜在重复项的唯一ID转储到临时表中，然后将其用作游标的内部联接（同样，这将是第一步）。 / p>

在查看光标时，我看到很多相对较重的函数调用可以解释你的减速。还有很多数据活动，如果你没有受到I / O瓶颈的压制，我不会感到惊讶。

然后你可以做的一件事是，而不是使用光标，将其转储到您选择的编程语言中。假设我们已经将除时间戳之外的所有字段限制为可管理的集合，依次获取每个子集（即与剩余字段匹配的子集），因为任何重复都必须使其所有其他字段匹配。然后扼杀你在这些较小的原子子集中找到的重复项。

所以假设你有1000万个潜力，并且每个时间范围有大约20个记录，或者需要使用日期逻辑，那么你可以使用更少数量的数据库调用和一些快速代码 - 以及从经验来看，淘汰SQL之外的日期时间比较等通常要快得多。

底线是找出尽快将数据划分为可管理子集的方法。

希望有所帮助！

-Bob

Answer 5

回答保罗的问题：

当您有三个条目a，b，c时会发生什么。 a = 00秒b = 19秒c = 39秒＆gt;这些都被认为是同一时间吗？（a在b的20秒内，b在c的20秒内）

如果其他比较相等（AssociatedEntityid和AssociatedEntityType）则为yes，则认为它们是相同的，否则为no。

我会添加原始问题，但我使用其他帐户发布问题，现在无法记住我的密码。这是一个非常古老的帐户，并没有意识到我已经与它连接到该网站。

我一直在处理你们给我的一些答案，并且有一个问题，当有两个（AssociatedEntityID和AssociatedEntityType）时，你只使用一个键列（AssociatedEntityid）。您的建议适用于单个关键列。

到目前为止，我所做的是：

步骤1：确定哪个AssociatedEntityID和AssociatedEntityType对具有重复项并将它们插入临时表：

create table ##stage1 (ID   int, AssociatedEntityID     int, AssociatedEntityType   int, [Timestamp]    datetime)

insert into ##stage1 (AssociatedEntityID, AssociatedEntityType)
    (select AssociatedEntityID, AssociatedEntityType from tblHBMLog group by AssociatedEntityID, AssociatedEntityType having COUNT(*) > 1)

步骤2：使用给定的AssociatedEntityID和AssociatedEntityType对检索最早出现的行的ID：

declare curStage1 cursor for 
    select AssociatedEntityID, AssociatedEntityType from ##stage1

open curStage1  
fetch next from curStage1 into @AssocEntity, @AssocType
while @@FETCH_STATUS = 0
begin
    select top 1 @ID = ID, @Timestamp = [Timestamp] from tblHBMLog where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType order by [Timestamp] asc
    update ##stage1 set ID = @ID, [Timestamp] = @Timestamp where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType
end

这就是事情再次放缓的地方。现在，结果集已经从大约1700万减少到不到400,000，但它仍然需要相当长的时间才能完成。

我想我应该问的另一个问题是这个问题;如果我继续在SQL中写这个，它只需要花费很长时间吗？我应该用C＃写这个吗？或者我只是愚蠢而没有看到森林中的树木？

好吧，经过多次踩脚和咬牙切齿之后，我想出了一个解决方案。它只是一个简单，快速和脏的C＃命令行应用程序，但它比sql脚本更快，它完成了这项工作。

我感谢大家的帮助，最后sql脚本只是花了太多时间执行而且C＃更适合循环。

对sql server 2005表中的行进行重复数据删除

5 个答案: