我正在运行以下查询,这非常低效,可能需要数小时。我今天有SQL脑屁,我不知道如何改善这个查询。有几个可以为空的varchar字段,我需要识别重复的行(所有列包含相同的值作为另一行)
select * from transactions x where exists (
select Coalesce(ColA, ''),
Coalesce(ColB, ''),
Coalesce(ColC, '')
from transactions y
where Coalesce(x.ColA, '') = Coalesce(x.ColA, '') and
Coalesce(x.ColB, '') = Coalesce(x.ColB, '') and
Coalesce(x.ColC, '') = Coalesce(x.ColC, '')
group by Coalesce(ColA, ''),
Coalesce(ColB, ''),
Coalesce(ColC, '')
having count(*) > 1
)
为什么这需要这么长时间才能运行?必须有更好的方法。
答案 0 :(得分:3)
您可以通过
进行改进ColA
,ColB
和ColC
什么是不必要的?似乎没有必要将表连接到自身。为什么不使用简单的GROUP BY
?您也不需要WHERE
:
SELECT COALESCE(ColA, '') AS ColA,
COALESCE(ColB, '') AS ColB,
COALESCE(ColC, '') AS ColC,
Count(*) As Cnt
FROM transactions t
GROUP BY COALESCE(ColA, ''), COALESCE(ColB, ''), COALESCE(ColC, '')
HAVING Count(*) > 1
答案 1 :(得分:2)
这有用吗?
DECLARE @transactions TABLE (
ColA INT
, ColB INT
, ColC INT
, ColD INT
, ColE INT
, ColF INT
)
DECLARE @Counter1 INT = 0
WHILE @Counter1 < 10000
BEGIN
SET @Counter1 += 1
INSERT INTO @transactions
SELECT ROUND(RAND()*10,0)
, ROUND(RAND()*10,0)
, ROUND(RAND()*10,0)
, ROUND(RAND()*10,0)
, ROUND(RAND()*10,0)
, ROUND(RAND()*10,0)
END
;WITH Dupe
AS (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY ColA, ColB, ColC, ColD, ColE, ColF
ORDER BY ColA, ColB, ColC, ColD, ColE, ColF) AS rn
FROM @transactions
)
SELECT * FROM Dupe WHERE rn > 1
您可以在需要比较可能为null的值的任何地方使用ISNULL。请注意,我编写的大部分内容仅用于生成有用的数据集。在6列和10,000行中,我在不到一秒的时间内获得了42个相同的行。没有三倍。把它高达100,000行,我有3,489个重复行,包括一些三元组。花了3秒钟。
以下是使用文字的示例。尽管我的计时器显示其中不到4个是找到重复的,剩下的就是表格数量,但整个过程在100,000条记录上耗费了25秒。
DECLARE @transactions2 TABLE (
ColA NVARCHAR(30)
, ColB NVARCHAR(30)
, ColC NVARCHAR(30)
, ColD NVARCHAR(30)
, ColE NVARCHAR(30)
, ColF NVARCHAR(30)
)
DECLARE @names TABLE (
ID INT IDENTITY
, Name NVARCHAR(30)
)
DECLARE @Counter2 INT = 0
, @ColA NVARCHAR(30)
, @ColB NVARCHAR(30)
, @ColC NVARCHAR(30)
, @ColD NVARCHAR(30)
, @ColE NVARCHAR(30)
, @ColF NVARCHAR(30)
INSERT INTO @names VALUES
('Anderson, Arthur')
, ('Broberg, Bruce')
, ('Chan, Charles')
, ('Davidson, Darwin')
, ('Eggert, Emily')
, ('Fox, Francesca')
, ('Garbo, Greta')
, ('Hollande, Hortense')
, ('Iguadolla, Ignacio')
, ('Jackson, Jurimbo')
, ('Katana, Ken')
, ('Lawrence, Larry')
, ('McDonald, Michael')
, ('Nyugen, Nathan')
, ('O''Dell, Oliver')
, ('Peterson, Phillip')
, ('Quigley, Quentin')
, ('Ramallah, Rodolfo')
, ('Smith, Samuel')
, ('Turner, Theodore')
, ('Uno, Umberto')
, ('Victor, Victoria')
, ('Wallace, William')
, ('Xing, Xiopan')
, ('Young, Yvette')
, ('Zapata, Zorro')
, (NULL)
WHILE @Counter2 < 100000
BEGIN
SET @Counter2 += 1
SET @ColA = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
SET @ColB = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
SET @ColC = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
SET @ColD = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
SET @ColE = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
SET @ColF = (SELECT Name FROM @names WHERE ID = ROUND(RAND()*27 +.5,0))
INSERT INTO @transactions2
SELECT @ColA, @ColB, @ColC, @ColD, @ColE, @ColD
END
PRINT CAST(GETDATE() AS DateTime2 (3))
;WITH Dupe
AS (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY ISNULL(ColA,''), ISNULL(ColB,''), ISNULL(ColC,''), ISNULL(ColD,''), ISNULL(ColE,''), ISNULL(ColF,'')
ORDER BY ISNULL(ColA,''), ISNULL(ColB,''), ISNULL(ColC,''), ISNULL(ColD,''), ISNULL(ColE,''), ISNULL(ColF,'')) AS rn
FROM @transactions2
)
SELECT * FROM Dupe WHERE rn > 1 ORDER BY rn
PRINT CAST(GETDATE() AS DateTime2 (3))
答案 2 :(得分:0)
使用子查询连接这是一种更快的方法。它在10秒内运行
select * from transactions x
join (
select Coalesce(ColA, ''),
Coalesce(ColB, ''),
Coalesce(ColC, '')
from transactions
group by Coalesce(ColA, ''),
Coalesce(ColB, ''),
Coalesce(ColC, '')
having count(*) > 1
) dups on
dups.ColA = x.ColA and
dups.ColB = x.ColB and
dups.ColC = x.ColC
关于此查询的重要一点是它返回两个/所有行,而不仅仅是重复
答案 3 :(得分:-1)
如果这是一次性作业,并且涉及大量行,而不是作为View,那么您可能选择将其INSERT SELECT到具有UNIQUE索引和IGNORE_DUP_KEY选项的表中。