删除重复最简单的方法/解释

时间:2012-08-13 00:27:10

标签: sql sql-server-2008

首先我要说的是,我(作为Newb)确实搜索了几个Q&关于表格中的重复但不幸的是,我无法操纵用作答案的代码。

我的表格是在SQL Server 2008中排序的报告中生成的。

我想知道如何删除重复记录并附上说明。

"MyTable":

Column1   (PK-auto incremental table's record ID) 
Column2   (some TXT) 
Column3   (Some TXT)
Column4   (SmallDateTime)
Column5   is empty 

Column5将保留SUM(count of deleted duplicates including this survived row)

的值

可能情况下解决方案的关键是如果[column2 and column3]具有多个具有相同内容的记录(因此是重复项),则它们并不总是共享相同的日期(column4)。

由此:

col1  col2   col3  col4         col5
----  -----  ----  -----------  ----
1     [abc]  [4]   [10/1/2012]  null
2     [abc]  [1]   [12/1/2012]  null
3     [ghi]  [6]   [4/1/2012]   null
4     [def]  [5]   [8/1/2012]   null
5     [abc]  [4]   [10/1/2012]  null
6     [def]  [5]   [12/1/2012]  null
7     [ghi]  [6]   [15/1/2012]  null
8     [abc]  [4]   [17/1/2012]  null
9     [ghi]  [6]   [6/1/2012]   null
10    [abc]  [1]   [13/1/2012]  null

进入这个:

col1  col2   col3  col4         col5
----  -----  ----  -----------  ----
8     [abc]  [4]   [17/1/2012]  2
10    [abc]  [1]   [13/1/2012]  3
6     [def]  [5]   [12/1/2012]  2
7     [ghi]  [6]   [15/1/2012]  3

含义将最新的(1)留作每个重复记录的表示。

++ ++重新编辑

亚伦伯特兰德 shawnt00 e2nburner ......以及你们的其他人 我不能说我多么感谢你的回复,虽然我还没理解那么大量的代码。 我现在要检查那些代码,但不是b4感谢你们!

当我第一次开始编程并使用

后需要sql查询
Select * From MyTable

...我的第一个SQL声明......

我说我知道SQL! ....现在......看看你们那些深刻的知识......感谢很多我知道StackOverFlow中的这篇文章对其他初学者来说也会更有用

3 个答案:

答案 0 :(得分:2)

此答案使用common table expressionrow_number()和count()应用于每个“切片”数据(意味着按col2 + col3分组)。 count()用于标识每个这样的组有多少行,row_number()用于应用col4 desc排序的“rank”(1 =每组最新,2 =最新的第二等)。这也使用col1(看起来像一个独特的列)来打破任何关系。 CTE后面可以跟一个查询,例如选择,更新,删除等。因此,您可以运行第一个选择来验证这些是您要保留的行,并且计数是正确的。如果是,则可以继续进行更新和删除。您会注意到,在所有情况下,row_number()输出用于标识您保留的行或您丢弃的行。

识别要保留的行:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, 
    c = COUNT(*) OVER (PARTITION BY col2, col3),
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
SELECT col1, col2, col3, col4, c
  FROM n WHERE rn = 1;

一旦您确认这些是您要保留的行,您可以像这样更新它们:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, col5, 
    c = COUNT(*) OVER (PARTITION BY col2, col3),
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
UPDATE n SET col5 = c
  WHERE rn = 1;

然后以这种方式删除余数:

;WITH n AS 
(
  SELECT col1, col2, col3, col4, 
    rn = ROW_NUMBER() OVER 
    (
      PARTITION BY col2, col3 ORDER BY col4 DESC, col1 DESC
    )
  FROM dbo.table_name
)
DELETE n WHERE rn > 1;

或者甚至更简单(假设col5在更新之前完全为空):

DELETE dbo.table_name WHERE col5 IS NULL;

答案 1 :(得分:1)

这是一种简单的方法。您可能会发现merge更好。这些版本保留最高col1值并修改maxdate列。 Aaron用maxdate保留了这一行。这是一个区别我怀疑是重要但应该注意。

update MyTable
set col4 = (
    select max(col4)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
),  col5 = (
    select count(*)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col1 > MyTable.col1
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

delete from MyTable
where exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col1 > MyTable.col1
);

编辑2 以下是merge查询的镜头

merge MyTable as target
using (
    select max(col1), col2, col3, max(col4), count(*)
    from Mytable
    group by col2, col3
) as source(id, col2, col3, maxdate, rowcount)
on (
        target.col1 = source.col1
    and target.col2 = target.col2
    and target.col3 = target.col3
)
when matched then
    update set col4 = maxdate, col5 = rowcount
when not matched then delete

编辑3 使用原始maxdate保留行,断开col1上的关系

-- option #1
update MyTable
set col5 = (
    select count(*)
    from MyTable as m2
    where m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
)
where not exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

delete from MyTable
where exists (
    select *
    from MyTable as m2
    where
        m2.col2 = MyTable.col2 and m2.col3 = MyTable.col3
        and m2.col4 > MyTable.col4 or m2.col4 = MyTable.col4 and m2.col1 > MyTable.col1
);

-- option #2
merge MyTable as target
using (
    select max(col1), col2, col3, max(col4), count(*)
    from Mytable
    group by col2, col3
) as source(maxid, col2, col3, maxdate, rowcount)
on (
        target.col2 = target.col2
    and target.col3 = target.col3
    and target.col1 = maxid
    and target.col4 = maxdate
)
when matched then
    update set col5 = rowcount
when not matched then delete

答案 2 :(得分:0)

WITH a AS (
    SELECT  *,
            ROW_NUMBER() OVER (PARTITION BY colum2 ORDER BY colum3 desc) RowNum
    FROM    mytable
)
-- deleted rows will be:

delete from mytable
where [yourID] in

(SELECT [yourID]

FROM    a
WHERE   a.RowNum <> 1 )