SQL“重复数据删除”使用NULL作为通配符

时间:2018-05-27 01:55:43

标签: sql amazon-redshift

考虑下表,其中没有列具有NULL约束:

  a   |  b   |  c   |  d
------+------+------+------
    3 |    5 |   12 | NULL
 NULL |    5 |   12 | NULL
   13 | NULL |   26 | NULL
 NULL | NULL |   26 |    4
    6 |    7 |    5 | NULL
    6 | NULL | NULL | NULL
    6 | NULL |    5 | NULL
    6 |    7 | NULL | NULL
 NULL | NULL | NULL | NULL
(9 rows)

所有九行都是不同的,但如果我们将NULL作为“通配符”,意味着它可以采用任何值,那么只有第一,第三,第四和第五行肯定是不同的。由于其他行的非空值都出现在肯定不同的行中,我想删除这些行以产生下表:

  a   |  b   | c  |  d
------+------+----+------
    3 |    5 | 12 | NULL
   13 | NULL | 26 | NULL
 NULL | NULL | 26 |    4
    6 |    7 |  5 | NULL
(4 rows)

在给定的表中,如何删除其值表示同一表中另一行的值的子集的行?提出这个问题的另一种方法是,如何使用NULL作为通配符对表进行重复数据删除?

不要担心删除实际的重复行(这就是为什么我在标题中的引号中加上“重复数据删除”)。特别是,我希望能够在PostgreSQL和Redshift中执行此操作。

作为参考,这些语句创建上述原始表:

CREATE TABLE t (a int, b int, c int, d int);
INSERT INTO t
VALUES (   3,    5,   12, NULL),
       (NULL,    5,   12, NULL),
       (  13, NULL,   26, NULL),
       (NULL, NULL,   26,    4),
       (   6,    7,    5, NULL),
       (   6, NULL, NULL, NULL),
       (   6, NULL,    5, NULL),
       (   6,    7, NULL, NULL),
       (NULL, NULL, NULL, NULL);

4 个答案:

答案 0 :(得分:1)

仅根据NULL通配符选择没有匹配项的那些 使用NOT EXISTS:

SELECT *
FROM T AS t
WHERE NOT EXISTS (
    SELECT 1
    FROM T AS dup
    WHERE (dup.a = t.a OR t.a IS NULL)
      AND (dup.b = t.b OR t.b IS NULL)
      AND (dup.c = t.c OR t.c IS NULL)
      AND (dup.d = t.d OR t.d IS NULL)
      AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)

仅根据NULL通配符选择重复项 使用EXISTS:

SELECT *
FROM T AS t
WHERE EXISTS (
    SELECT 1
    FROM T AS dup
    WHERE (dup.a = t.a OR t.a IS NULL)
      AND (dup.b = t.b OR t.b IS NULL)
      AND (dup.c = t.c OR t.c IS NULL)
      AND (dup.d = t.d OR t.d IS NULL)
      AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)

根据NULL通配符从表中删除重复项 使用EXISTS:

DELETE
FROM T AS t
WHERE EXISTS (
    SELECT 1
    FROM T AS dup
    WHERE (dup.a = t.a OR t.a IS NULL)
      AND (dup.b = t.b OR t.b IS NULL)
      AND (dup.c = t.c OR t.c IS NULL)
      AND (dup.d = t.d OR t.d IS NULL)
      AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)

请注意,由于CONCAT上的比较,具有完全重复的记录不会被视为重复。

如果表格有ID作为主键,那么CONCAT的比较可以替换为

AND dup.ID <> t.ID

但那些具有完全重复的内容也会被视为重复。

答案 1 :(得分:1)

这将无法检测到真正的重复项(捕获两者),我认为我们仍然需要ctid(或一些游标的东西)

WITH enums AS (
        SELECT x.a, x.b,x.c,x.d
        -- , (x.a IS NULL)::integer + (x.b IS NULL)::integer 
         -- + (x.c IS NULL)::integer + (x.d IS NULL)::integer AS nnull
        , row_number() OVER www AS rn
        FROM tbl x
        JOIN tbl y
        ON (x.a =y.a OR x.a IS NULL)
        AND (x.b =y.b OR x.b IS NULL)
        AND (x.c =y.c OR x.c IS NULL)
        AND (x.d =y.d OR x.d IS NULL)
        WINDOW WWW AS
        (PARTITION BY COALESCE(x.a ,y.a), COALESCE(x.b ,y.b)
                , COALESCE(x.c ,y.c), COALESCE(x.d ,y.d)
         ORDER BY x.a NULLS LAST
        , x.b NULLS LAST
        , x.c NULLS LAST
        , x.d NULLS LAST )
        )
SELECT* --DELETE
-- FROM  enums ex ; \q
FROM tbl del
WHERE EXISTS ( SELECT *
        FROM  enums ex
        WHERE ex.rn > 1
        AND ex.a IS NOT DISTINCT FROM del.a
        AND ex.b IS NOT DISTINCT FROM del.b
        AND ex.c IS NOT DISTINCT FROM del.c
        AND ex.d IS NOT DISTINCT FROM del.d
        );

答案 2 :(得分:1)

FYI

在我提出问题后不久,有人发布了一个非常整洁的答案,但看起来他/她后来删除了它。答案中的代码并不是很有效,但我喜欢这种方法。我修改了它,它似乎做了这个工作:

DELETE FROM t
 WHERE EXISTS (
   SELECT u.*
     FROM t AS u
    WHERE (t.a IS NULL OR t.a = u.a)
      AND (t.b IS NULL OR t.b = u.b)
      AND (t.c IS NULL OR t.c = u.c)
      AND (t.d IS NULL OR t.d = u.d)
      AND NOT (
        (
          t.a IS NULL AND u.a IS NULL
          OR (
            t.a IS NOT NULL AND u.a IS NOT NULL
            AND t.a = u.a
          )
        )
        AND (
          t.b IS NULL AND u.b IS NULL
          OR (
            t.b IS NOT NULL AND u.b IS NOT NULL
            AND t.b = u.b
          )
        )
        AND (
          t.c IS NULL AND u.c IS NULL
          OR (
            t.c IS NOT NULL AND u.c IS NOT NULL
            AND t.c = u.c
          )
        )
        AND (
          t.d IS NULL AND u.d IS NULL
          OR (
            t.d IS NOT NULL AND u.d IS NOT NULL
            AND t.d = u.d
          )
        )
      )
 );

Redshift尚未支持的更简洁版本:

DELETE FROM t
 WHERE EXISTS (
   SELECT u.*
     FROM t AS u
    WHERE (t.a IS NULL OR t.a = u.a)
      AND (t.b IS NULL OR t.b = u.b)
      AND (t.c IS NULL OR t.c = u.c)
      AND (t.d IS NULL OR t.d = u.d)
          EXCEPT
   SELECT t.*
 );

答案 3 :(得分:0)

这将找到带有通配符(空值)的唯一行,但如果每个匹配的空值超过1,则可能无效。

首先获取没有空值的行,并将其连接回基表并填充空值。最后,选择distinct会产生所需的结果。

WITH nn AS (
    SELECT * 
    FROM t 
    WHERE a IS NOT NULL 
      AND b IS NOT NULL 
      AND c IS NOT NULL 
      AND d IS NOT NULL
)
SELECT DISTINCT 
      COALESCE(t1.a, nn.a) a
    , COALESCE(t1.b, nn.b) b
    , COALESCE(t1.c, nn.c) c
    , COALESCE(t1.d, nn.d) d
    FROM t t1 
    LEFT JOIN nn
           ON (t1.a IS NULL OR t1.a = nn.a)
          AND (t1.b IS NULL OR t1.b = nn.b)
          AND (t1.c IS NULL OR t1.c = nn.c)
          AND (t1.d IS NULL OR t1.d = nn.d)

例如,如果插入两个额外的行,这将不起作用:

insert into t values (2, 10, 12, null), (2, null, 12, 10);

因为not-nulls(nn)的cte省略了两者。