查询以查找表中的重复行

时间:2016-09-14 14:13:51

标签: tsql

我正在运行以下查询,这非常低效,可能需要数小时。我今天有SQL脑屁,我不知道如何改善这个查询。有几个可以为空的varchar字段,我需要识别重复的行(所有列包含相同的值作为另一行)

select * from transactions x where exists (
  select Coalesce(ColA, ''),
         Coalesce(ColB, ''),
         Coalesce(ColC, '')
  from transactions y
  where Coalesce(x.ColA, '') = Coalesce(x.ColA, '') and
        Coalesce(x.ColB, '') = Coalesce(x.ColB, '') and
        Coalesce(x.ColC, '') = Coalesce(x.ColC, '')
  group by Coalesce(ColA, ''),
           Coalesce(ColB, ''),
           Coalesce(ColC, '')
  having count(*) > 1
)

为什么这需要这么长时间才能运行?必须有更好的方法。

4 个答案:

答案 0 :(得分:3)

您可以通过

进行改进
  1. 删除不必要的支票
  2. ColAColBColC
  3. 上添加综合索引

    什么是不必要的?似乎没有必要将表连接到自身。为什么不使用简单的GROUP BY?您也不需要WHERE

    SELECT COALESCE(ColA, '') AS ColA, 
           COALESCE(ColB, '') AS ColB,
           COALESCE(ColC, '') AS ColC,
           Count(*) As Cnt
    FROM transactions t
    GROUP BY COALESCE(ColA, ''), COALESCE(ColB, ''), COALESCE(ColC, '')
    HAVING Count(*) > 1
    

答案 1 :(得分:2)

这有用吗?

DECLARE @transactions TABLE (
      ColA      INT
    , ColB      INT
    , ColC      INT
    , ColD      INT
    , ColE      INT
    , ColF      INT
    )

DECLARE @Counter1       INT = 0

WHILE @Counter1 < 10000
    BEGIN
        SET @Counter1 += 1
        INSERT INTO @transactions
            SELECT    ROUND(RAND()*10,0)
                    , ROUND(RAND()*10,0)
                    , ROUND(RAND()*10,0)
                    , ROUND(RAND()*10,0)
                    , ROUND(RAND()*10,0)
                    , ROUND(RAND()*10,0)
    END

;WITH Dupe
    AS (
        SELECT *, ROW_NUMBER() OVER
                (PARTITION BY ColA, ColB, ColC, ColD, ColE, ColF
                ORDER BY ColA, ColB, ColC, ColD, ColE, ColF) AS rn
            FROM @transactions
        )

SELECT * FROM Dupe WHERE rn > 1

您可以在需要比较可能为null的值的任何地方使用ISNULL。请注意,我编写的大部分内容仅用于生成有用的数据集。在6列和10,000行中,我在不到一秒的时间内获得了42个相同的行。没有三倍。把它高达100,000行,我有3,489个重复行,包括一些三元组。花了3秒钟。

以下是使用文字的示例。尽管我的计时器显示其中不到4个是找到重复的,剩下的就是表格数量,但整个过程在100,000条记录上耗费了25秒。

    DECLARE @transactions2 TABLE (
      ColA      NVARCHAR(30)
    , ColB      NVARCHAR(30)
    , ColC      NVARCHAR(30)
    , ColD      NVARCHAR(30)
    , ColE      NVARCHAR(30)
    , ColF      NVARCHAR(30)
    )

    DECLARE @names TABLE (
      ID        INT IDENTITY
    , Name      NVARCHAR(30)
    )

DECLARE   @Counter2     INT = 0
        , @ColA         NVARCHAR(30)
        , @ColB         NVARCHAR(30)
        , @ColC         NVARCHAR(30)
        , @ColD         NVARCHAR(30)
        , @ColE         NVARCHAR(30)
        , @ColF         NVARCHAR(30)

INSERT INTO @names VALUES
      ('Anderson, Arthur')
    , ('Broberg, Bruce')
    , ('Chan, Charles')
    , ('Davidson, Darwin')
    , ('Eggert, Emily')
    , ('Fox, Francesca')
    , ('Garbo, Greta')
    , ('Hollande, Hortense')
    , ('Iguadolla, Ignacio')
    , ('Jackson, Jurimbo')
    , ('Katana, Ken')
    , ('Lawrence, Larry')
    , ('McDonald, Michael')
    , ('Nyugen, Nathan')
    , ('O''Dell, Oliver')
    , ('Peterson, Phillip')
    , ('Quigley, Quentin')
    , ('Ramallah, Rodolfo')
    , ('Smith, Samuel')
    , ('Turner, Theodore')
    , ('Uno, Umberto')
    , ('Victor, Victoria')
    , ('Wallace, William')
    , ('Xing, Xiopan')
    , ('Young, Yvette')
    , ('Zapata, Zorro')
    , (NULL)

WHILE @Counter2 < 100000
    BEGIN
        SET @Counter2 += 1
        SET @ColA = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
        SET @ColB = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
        SET @ColC = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
        SET @ColD = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
        SET @ColE = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
        SET @ColF = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))

        INSERT INTO @transactions2
            SELECT @ColA, @ColB, @ColC, @ColD, @ColE, @ColD
    END
PRINT CAST(GETDATE() AS DateTime2 (3))
;WITH Dupe
    AS (
        SELECT *, ROW_NUMBER() OVER
                (PARTITION BY ISNULL(ColA,''), ISNULL(ColB,''), ISNULL(ColC,''), ISNULL(ColD,''), ISNULL(ColE,''), ISNULL(ColF,'')
                ORDER BY ISNULL(ColA,''), ISNULL(ColB,''), ISNULL(ColC,''), ISNULL(ColD,''), ISNULL(ColE,''), ISNULL(ColF,'')) AS rn
            FROM @transactions2
        )

SELECT * FROM Dupe WHERE rn > 1 ORDER BY rn
PRINT CAST(GETDATE() AS DateTime2 (3))

答案 2 :(得分:0)

使用子查询连接这是一种更快的方法。它在10秒内运行

select * from transactions x
join (
  select Coalesce(ColA, ''),
         Coalesce(ColB, ''),
         Coalesce(ColC, '')
  from transactions
  group by Coalesce(ColA, ''),
           Coalesce(ColB, ''),
           Coalesce(ColC, '')
  having count(*) > 1
) dups on
dups.ColA = x.ColA and
dups.ColB = x.ColB and
dups.ColC = x.ColC

关于此查询的重要一点是它返回两个/所有行,而不仅仅是重复

答案 3 :(得分:-1)

如果这是一次性作业,并且涉及大量行,而不是作为View,那么您可能选择将其INSERT SELECT到具有UNIQUE索引和IGNORE_DUP_KEY选项的表中。