按多列标准删除重复项

时间:2018-06-04 10:20:04

标签: sql sql-server

我有下表

CREATE TABLE Test (
     ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
     FIRST VARCHAR(10) NOT NULL,
     SECOND VARCHAR(10) NOT NULL
)

表中填充了一些重复数据。 TestTarget 表具有相同的结构,并使用以下过程算法填充:

DECLARE @first varchar(10), @second varchar(10)
DECLARE c CURSOR FAST_FORWARD
  FOR
    SELECT first, second FROM Test ORDER BY id

OPEN c
FETCH NEXT FROM c INTO @first, @second
WHILE @@fetch_status = 0
  BEGIN
    IF NOT EXISTS(SELECT 1 FROM TestTarget WHERE first=@first OR second=@second)
      INSERT INTO TestTarget (first, second) VALUES(@first, @second)

    FETCH NEXT FROM c INTO @first, @second
  END
CLOSE c
DEALLOCATE c 

简要地说,我们在插入之前检查目标表,如果它已经包含这样的'第一个' OR '第二'值。

示例:

来源表

ID  FIRST SECOND
1   A     2
2   A     1
3   A     3
4   B     2
5   B     1
6   B     3
7   B     2
8   B     4
9   C     2
10  C     3

INSERT INTO Test (first, second)
  VALUES ('A', '2'),
         ('A', '1'),
         ('A', '3'),
         ('B', '2'),
         ('B', '1'),
         ('B', '3'),
         ('B', '2'),
         ('B', '4'),
         ('C', '2'),
         ('C', '3')

目标表

ID  FIRST SECOND
1   A     2
5   B     1
10  C     3

Real source table有相同的'first'或'second'列x * 100k行和至少2行。 我正在寻找基于集合的解决方案,如果有可能或者请至少比这种循环更快,因为我的实际情况需要几个小时。

注意通过分区/加入/等经典重复删除。这不是这种情况,因为即使行的最终行数不同,它也会产生不同的结果。

3 个答案:

答案 0 :(得分:0)

INSERT INTO TestTarget (first, second)
SELECT first,second
FROM Test t
WHERE NOT EXISTS 
(
    SELECT 1 
    FROM Test t2 
    WHERE t2.id>t.id and (t2.first=t.first or t2.second=t.second)
)

答案 1 :(得分:0)

我担心,我不会想到任何基于简单集的解决方案,但我希望以下几行中的内容比现有游标快得多:

declare @test table
(id int,
first varchar(1),
second varchar(1))

declare @target table
(id int,
first varchar(1),
second varchar(1))

declare @temp table
(id int,
first varchar(1),
second varchar(1))

INSERT INTO @Test (id, first, second)
  VALUES (1, 'A', '2'),
         (2, 'A', '1'),
         (3, 'A', '3'),
         (4, 'B', '2'),
         (5, 'B', '1'),
         (6, 'B', '3'),
         (7, 'B', '2'),
         (8, 'B', '4'),
         (9, 'C', '2'),
         (10, 'C', '3')

declare @firsts table
(first varchar(1))

declare @seconds table
(second varchar(1))

INSERT INTO @firsts
SELECT DISTINCT first FROM @test

INSERT INTO @seconds
SELECT DISTINCT second FROM @test

declare @firstcnt int = (SELECT count(*) FROM @firsts)
declare @secondcnt int = (SELECT count(*) FROM @firsts)

WHILE (@firstcnt > 0 AND @secondcnt > 0)
BEGIN
    DELETE FROM @temp
    INSERT INTO @temp 
    SELECT TOP 1 t.id, t.first, t.second FROM @test t
    INNER JOIN @firsts f On t.first = f.first
    INNER JOIN @seconds s On t.second = s.second
    ORDER BY id
    INSERT INTO @target 
    SELECT * FROM @temp
    DELETE FROM @firsts WHERE first = (SELECT first FROM @temp)
    SET @firstcnt = @firstcnt - 1
    DELETE FROM @seconds WHERE second = (SELECT second FROM @temp)
    SET @secondcnt = @secondcnt - 1
END

SELECT * FROM @target

这确实产生了所需的值,我希望它更快,因为while循环只需要运行唯一值对的总数,而不是必须遍历整个表。

尽管有@ Gordon的评论,但最后一行也提供10 C 3,我认为这是正确的。如果我正确理解了这个问题,那么ID顺序优先:也就是说,虽然'A'和'B'的条目为'3'作为第二个值,但这些条目的id大于另一个可以的第二个值合法插入。

HTH

答案 2 :(得分:0)

使用递归CTE,

declare @Target table(col1 varchar(20),col2 int)
declare @Test table(col1 varchar(20),col2 int)

INSERT INTO @Test (col1, col2
VALUES ('A', '2')
('A', '1')
('A', '3'),
('B', '1')
('B', '2'),
('B', '3'),
('B', '2'),
('B', '4'),
('C', '2'),
('C', '3')
     

;With CTE as
(
select col1 ,col2
,DENSE_RANK()over( ORDER by col1)rn1
from @Test
)

,cte1 AS(
select top 1 c.col1,c.col2,rn1 from cte c where rn1=1
union ALL
select c.col1,c.col2,c.rn1 from cte c
inner join cte1 c1
on c.rn1>c1.rn
where c.col2!=c1.col2
)

insert into @Target

select col1,col2 FROM(
select *,ROW_NUMBER()over(partition by col1 order by (select null)) rn2 from cte1

)t4
where rn2=1

select * from @Target