如何在SQL 2000中删除重复行?

时间:2010-04-21 21:36:09

标签: sql sql-server-2000

我以为我已经想到了这一点,但事实证明我只是删除了第一条记录。以下内容返回重复的行。所有都有2.我只想删除每个重复记录的第一个。

select scorestudentid, scoreadvisor, scorecorrect, count(*) 
from scores
where scoretestid = 3284
group by scorestudentid, scoreadvisor, scorecorrect
having count(scorestudentid) > 1

返回:

scorestudentid  scoreadvisor  scorecorrect  no column name
13033719        28059     3.0           2
13033777        28086     3.0           2
13033826        28147     3.0           2
13033960        28023     3.0           2

所以我把它放在一起认为它会起作用:

set rowcount 1
delete
from scores
where scoretestid = 3284 
and scorestudentid in (
    select scorestudentid
    from scores
    where scoretestid = 3284
    group by scorestudentid
    having count(scorestudentid) > 1)

它看起来应该是一个简单的概念,但我没有得到它。

基于Thomas脚本,我更新了查询以适应但仍然无效。

Delete Scores
Where Exists    (
                Select 1
                From Scores As S2
                Where S2.ScoreStudentId = Scores.ScoreStudentId
                        And S2.ScoreAdvisor = Scores.ScoreAdvisor
                        And S2.ScoreCorrect = Scores.ScoreCorrect
                Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
                Having Count(*) > 1
                    And Min(S2.NewScoreID) = Scores.NewScoreID
                )
    And Scores.ScoreTestId = 3284

3 个答案:

答案 0 :(得分:5)

诀窍是使用主键列(你有一个,正确吗?)并只是找到符合你想要的标准的第一个PK值。如果由于某些疯狂的原因您没有主键列,则添加一个Identity列并将其作为主键,然后执行删除。

编辑修改以使其更通用。如果您删除ScoreTest上的最终过滤器,它将根据ScoreStudentId,ScoreAdvisor和ScoreCorrect删除所有重复项。

Delete Scores
Where Exists    (
                Select 1
                From Scores As S2
                Where S2.ScoreStudentId = Scores.ScoresStudentId
                        And S2.ScoreAdvisor = Scores.ScoreAdvisor
                        And S2.ScoreCorrect = Scores.ScoreCorrect
                Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
                Having Count(*) > 1
                    And Min(S2.PrimaryKeyColumn) = Scores.PrimaryKeyColumn
                )
    And Scores.ScoreTest = 3284

答案 1 :(得分:0)

我相信Thomas的解决方案不适用于主键的uniqueidentifier。此外,如果一个记录在表格中多次重复(即3,4,5次),则只会删除一个。

这就是我们使用的:

声明@ col1 uniqueidentifier 声明@col2 varchar(256) 声明@col3 datetime

DECLARE C CURSOR
FOR

            select col1, col2, col3
            from MyTable
            where IsDeleted = 0
            group by col1, col2, col3
            having count(*) > 1
OPEN    C
FETCH NEXT FROM C
INTO    @col1, @col2, @col3

WHILE @@FETCH_STATUS = 0
BEGIN

declare @primaryKey uniqueidentifier
set @primaryKey = (select top 1 primaryKey from MyTable
                            where col1 = @col1 and col2= @col2 and col3 = @col3)

update MyTable
set IsDeleted = 1, DeleteDt = GETDATE()
where col1 = @col1
    and col2 = @col2
    and col3 = @col3
    and PrimaryKey<> @primaryKey


FETCH NEXT FROM C
INTO    @col1, @col2, @col3
END

CLOSE C
DEALLOCATE C

这个光标的作用是:

  • 选择所有具有重复项的行
  • 对于每个重复的行集:
  • 获取集合
  • 中其中一行的主键
  • 逻辑删除行集中的所有其他行

答案 2 :(得分:0)

我将在SQL世界中讨论一个有趣的话题。如果你谷歌这个主题,你会发现从表中删除重复数据的多种方法。我不会写一些非常新的内容但是我会在使用传统方法删除重复数据时讨论性能问题。

从SQL 2000中删除重复的行: - 我创建了一个表DuplicateData,并根据EmpId插入了几个重复的行。

创建表DuplicateData(EmpId int,Name varchar(100)) - &gt;表创建

insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(5,'ankit')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(1,'Neeraj')
insert into DuplicateData values(1,'Neeraj')

insert into DuplicateData values(1,'Neeraj')

在SQL 2000中从表中删除重复行的传统方法: - 如果我们在查询分析器中运行以下批处理,它将从表DuplicateData中删除所有重复值。如果您在测试环境中或在虚拟数据上执行此查询,则此查询为“OK”。但是,如果您有数百万条记录或大数据,则此查询在性能方面将是最糟糕的查询。可能需要几个小时或几天,具体取决于预期表格中的数据量。

原因: - 查询下面是一个相关的子查询,它将对表中存在的每个EmpId执行,并检查每个EmpId的计数是否> 1然后逐个删除每个记录。这就是它性能下降的原因。

set rowcount 1
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1
while @@rowcount>0
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1

set rowcount 0

我们可以创建一个存储过程来克服这个性能问题。以下是示例。

declare @tmp table(empid int,cnt int, rowid int identity)--> declare table variable

declare @maxcounter as integer--> Declaration of variables
declare @mincounter as integer
declare @rowcnt as integer
declare @empid as int-->End of Declaration

insert into @tmp(empid,cnt)-->Inserting duplicate empid along with no of duplicate entries
select empid,count(empid) from duplicatedata 
group by empid having count(empid)>1

select @mincounter=min(rowid),@maxcounter=max(rowid) from @tmp -->assigning minimum and maximum rowid to variables.

while @mincounter <=@maxcounter
begin
 select @rowcnt=cnt,@empid=empid from @tmp where rowid=@mincounter 
 set @rowcnt =@rowcnt-1
 set rowcount @rowcnt
 delete from duplicatedata where empid=@empid
 set rowcount 0
 set @mincounter=@mincounter +1
end

让我们理解上面的while循环,我们在@tmp表中有所有重复记录,没有重复的条目。现在我们将循环遍历@tmp表中的每条记录,因此我们已经为变量分配了最小和最大rowid(@maxcounter,@ mincounter)。

在While循环体中,我们将“no of duplicate records”值分配给变量@rowcnt并将empid分配给变量@empid

在我们设置@ rowcnt = @ rowcnt-1的下一个语句中,我们这样做是因为此变量不包含特定empid的重复记录,但是我们希望保留一个empid与重复的记录。 在下一个语句中,我们设置的rowcount的值小于该特定empid的重复记录的值。

Next语句将rowcount重置为0,last语句增加@mincounter值以从@tmp表中获取下一条记录。