在数据库中标记重复项

时间:2012-07-12 16:05:10

标签: sql tsql

我创建了一个包含以下字段的表:

Record:
Id                 int Primary Key, Auto Increment
ForeignId          int
IsDuplicateRecord bit NULL

然后我插入了一些数据:

INSERT INTO Record (ForeignId)
VALUES (5), (5), (1), (2), (3)

之后,我运行了以下更新语句(在http://archive.msdn.microsoft.com/SQLExamples/Wiki/View.aspx?title=DuplicateRows找到):

UPDATE Record
SET IsDuplicateRecord = 1
WHERE Id IN (
    SELECT MAX(Id)
    FROM Record
    GROUP BY ForeignId
    HAVING COUNT(*) > 1
)

到目前为止,查询影响了一行,现在表格如下:

Id ForeignId IsDuplicateRecord
0  5         NULL
1  5         1
2  1         NULL
3  2         NULL
4  3         NULL

我很开心,因为有那么一刻我觉得一切都会好起来的。但随后出现了一种像外面的云一样黑暗的怀疑在我脑海中浮现: 很沮丧,我输入了

INSERT INTO Record (ForeignId)
VALUES (1), (1)

再次运行上面的查询,这次产生了:

Id  ForeignId  IsDuplicateRecord
0   0          NULL
1   5          1
2   1          NULL
3   2          NULL
4   3          NULL
5   1          NULL
6   1          1

所以我想我会去StackOverflow,看看谁可以向我解释为什么ID为5的行中的IsDuplicatedRecord字段没有更新为1?你是那个吗?

3 个答案:

答案 0 :(得分:5)

因为您运行的SQL仅将最后一个重复项标记为重复项。试试这个:

UPDATE Record
SET IsDuplicateRecord = 1
WHERE Id NOT IN (
    SELECT MIN(Id)
    FROM Record
    GROUP BY ForeignId
)

这标志着每个ForeignId的第二次和随后的重复,因为我认为你需要重复。

答案 1 :(得分:1)

UPDATE Record uu
SET IsDuplicateRecord = 1
   -- if there exists a record with the same foreignid
   -- but a lower id
   -- this (uu) is a duplicate
WHERE EXISTS (
    SELECT *
    FROM Record ex 
    WHERE ex.ForeignId = uu.ForeignId
    AND ex.Id < uu.Id
    );

EXISTS (...)子查询与@DavidM的NOT IN (...)子查询之间存在微妙的(但粗鲁)差异:NOT IN不会产生NULL值,如果“ForeignId”恰好为NULL,则NOT IN版本为“True”,导致为ForeignId IS NULL的所有元组设置所有isDuplicateRecord标志。 (我怀疑ForeignId是FK,所以它很可能是NULLable)

对于不可为空的ForeignId,这两个版本基本相同。

更新:正如@MartinSmith所指出的,有些实现不喜欢没有FROM子句的UPDATE ... WHERE。我们可以使用一个自我加入的假人。 (还将第一个查询更新为正常)

-- DROP SCHEMA tmp CASCADE;
-- CREATE SCHEMA tmp ;
-- SET search_path='tmp';

DROP TABLE zrecord CASCADE;
CREATE TABLE zrecord
        ( id SERIAL NOT NULL PRIMARY KEY
        , foreign_id INTEGER -- REFERENCES zrecord(id)
        , is_duplicate boolean DEFAULT False
        );
SELECT * FROM zrecord;

INSERT INTO zrecord(foreign_id) VALUES(NULL),(1),(NULL),(1),(NULL),(2),(NULL);

SELECT * FROM zrecord;

EXPLAIN ANALYZE
UPDATE zrecord uu
SET is_duplicate = True
        --
        -- This selfjoin is needed if UPDATE ... WHERE needs a FROM TABLE
        --
FROM zrecord dum
WHERE  dum.id = uu.id
AND EXISTS (
    SELECT *
    FROM zrecord ex
    WHERE ex.foreign_id = uu.foreign_id
    AND ex.Id < uu.Id
    );

SELECT * FROM zrecord;

UPDATE2:PARTITION BY遇到与IN子句相同的可空性问题,所以似乎:

WITH zcte AS (
    SELECT *
    , row_number() OVER (PARTITION BY foreign_id ORDER BY id) AS rn
    FROM   zrecord
    )
SELECT * FROM zcte;

结果:(原始测试集,在任何更新之前)

 id | foreign_id | is_duplicate | rn 
----+------------+--------------+----
  2 |          1 | f            |  1
  4 |          1 | t            |  2
  6 |          2 | f            |  1
  1 |            | f            |  1
  3 |            | f            |  2
  5 |            | f            |  3
  7 |            | f            |  4

答案 2 :(得分:0)

估计成本低于其他两个答案

;WITH CTE
     AS (SELECT *,
                Row_number() OVER (PARTITION BY ForeignId ORDER BY Id) AS RN
         FROM   Record)
UPDATE CTE
SET    IsDuplicateRecord = 1
WHERE  RN > 1 

执行计划

Plans