考虑下表,其中没有列具有NULL
约束:
a | b | c | d
------+------+------+------
3 | 5 | 12 | NULL
NULL | 5 | 12 | NULL
13 | NULL | 26 | NULL
NULL | NULL | 26 | 4
6 | 7 | 5 | NULL
6 | NULL | NULL | NULL
6 | NULL | 5 | NULL
6 | 7 | NULL | NULL
NULL | NULL | NULL | NULL
(9 rows)
所有九行都是不同的,但如果我们将NULL
作为“通配符”,意味着它可以采用任何值,那么只有第一,第三,第四和第五行肯定是不同的。由于其他行的非空值都出现在肯定不同的行中,我想删除这些行以产生下表:
a | b | c | d
------+------+----+------
3 | 5 | 12 | NULL
13 | NULL | 26 | NULL
NULL | NULL | 26 | 4
6 | 7 | 5 | NULL
(4 rows)
在给定的表中,如何删除其值表示同一表中另一行的值的子集的行?提出这个问题的另一种方法是,如何使用NULL
作为通配符对表进行重复数据删除?
不要担心删除实际的重复行(这就是为什么我在标题中的引号中加上“重复数据删除”)。特别是,我希望能够在PostgreSQL和Redshift中执行此操作。
作为参考,这些语句创建上述原始表:
CREATE TABLE t (a int, b int, c int, d int);
INSERT INTO t
VALUES ( 3, 5, 12, NULL),
(NULL, 5, 12, NULL),
( 13, NULL, 26, NULL),
(NULL, NULL, 26, 4),
( 6, 7, 5, NULL),
( 6, NULL, NULL, NULL),
( 6, NULL, 5, NULL),
( 6, 7, NULL, NULL),
(NULL, NULL, NULL, NULL);
答案 0 :(得分:1)
仅根据NULL通配符选择没有匹配项的那些 使用NOT EXISTS:
SELECT *
FROM T AS t
WHERE NOT EXISTS (
SELECT 1
FROM T AS dup
WHERE (dup.a = t.a OR t.a IS NULL)
AND (dup.b = t.b OR t.b IS NULL)
AND (dup.c = t.c OR t.c IS NULL)
AND (dup.d = t.d OR t.d IS NULL)
AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)
仅根据NULL通配符选择重复项 使用EXISTS:
SELECT *
FROM T AS t
WHERE EXISTS (
SELECT 1
FROM T AS dup
WHERE (dup.a = t.a OR t.a IS NULL)
AND (dup.b = t.b OR t.b IS NULL)
AND (dup.c = t.c OR t.c IS NULL)
AND (dup.d = t.d OR t.d IS NULL)
AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)
根据NULL通配符从表中删除重复项 使用EXISTS:
DELETE
FROM T AS t
WHERE EXISTS (
SELECT 1
FROM T AS dup
WHERE (dup.a = t.a OR t.a IS NULL)
AND (dup.b = t.b OR t.b IS NULL)
AND (dup.c = t.c OR t.c IS NULL)
AND (dup.d = t.d OR t.d IS NULL)
AND CONCAT(dup.a,'-',dup.b,'-',dup.c,'-',dup.d) <> CONCAT(t.a,'-',t.b,'-',t.c,'-',t.d)
)
请注意,由于CONCAT上的比较,具有完全重复的记录不会被视为重复。
如果表格有ID作为主键,那么CONCAT的比较可以替换为
AND dup.ID <> t.ID
但那些具有完全重复的内容也会被视为重复。
答案 1 :(得分:1)
这将无法检测到真正的重复项(捕获两者),我认为我们仍然需要ctid
(或一些游标的东西)
WITH enums AS (
SELECT x.a, x.b,x.c,x.d
-- , (x.a IS NULL)::integer + (x.b IS NULL)::integer
-- + (x.c IS NULL)::integer + (x.d IS NULL)::integer AS nnull
, row_number() OVER www AS rn
FROM tbl x
JOIN tbl y
ON (x.a =y.a OR x.a IS NULL)
AND (x.b =y.b OR x.b IS NULL)
AND (x.c =y.c OR x.c IS NULL)
AND (x.d =y.d OR x.d IS NULL)
WINDOW WWW AS
(PARTITION BY COALESCE(x.a ,y.a), COALESCE(x.b ,y.b)
, COALESCE(x.c ,y.c), COALESCE(x.d ,y.d)
ORDER BY x.a NULLS LAST
, x.b NULLS LAST
, x.c NULLS LAST
, x.d NULLS LAST )
)
SELECT* --DELETE
-- FROM enums ex ; \q
FROM tbl del
WHERE EXISTS ( SELECT *
FROM enums ex
WHERE ex.rn > 1
AND ex.a IS NOT DISTINCT FROM del.a
AND ex.b IS NOT DISTINCT FROM del.b
AND ex.c IS NOT DISTINCT FROM del.c
AND ex.d IS NOT DISTINCT FROM del.d
);
答案 2 :(得分:1)
FYI
在我提出问题后不久,有人发布了一个非常整洁的答案,但看起来他/她后来删除了它。答案中的代码并不是很有效,但我喜欢这种方法。我修改了它,它似乎做了这个工作:
DELETE FROM t
WHERE EXISTS (
SELECT u.*
FROM t AS u
WHERE (t.a IS NULL OR t.a = u.a)
AND (t.b IS NULL OR t.b = u.b)
AND (t.c IS NULL OR t.c = u.c)
AND (t.d IS NULL OR t.d = u.d)
AND NOT (
(
t.a IS NULL AND u.a IS NULL
OR (
t.a IS NOT NULL AND u.a IS NOT NULL
AND t.a = u.a
)
)
AND (
t.b IS NULL AND u.b IS NULL
OR (
t.b IS NOT NULL AND u.b IS NOT NULL
AND t.b = u.b
)
)
AND (
t.c IS NULL AND u.c IS NULL
OR (
t.c IS NOT NULL AND u.c IS NOT NULL
AND t.c = u.c
)
)
AND (
t.d IS NULL AND u.d IS NULL
OR (
t.d IS NOT NULL AND u.d IS NOT NULL
AND t.d = u.d
)
)
)
);
Redshift尚未支持的更简洁版本:
DELETE FROM t
WHERE EXISTS (
SELECT u.*
FROM t AS u
WHERE (t.a IS NULL OR t.a = u.a)
AND (t.b IS NULL OR t.b = u.b)
AND (t.c IS NULL OR t.c = u.c)
AND (t.d IS NULL OR t.d = u.d)
EXCEPT
SELECT t.*
);
答案 3 :(得分:0)
这将找到带有通配符(空值)的唯一行,但如果每个匹配的空值超过1,则可能无效。
首先获取没有空值的行,并将其连接回基表并填充空值。最后,选择distinct会产生所需的结果。
WITH nn AS (
SELECT *
FROM t
WHERE a IS NOT NULL
AND b IS NOT NULL
AND c IS NOT NULL
AND d IS NOT NULL
)
SELECT DISTINCT
COALESCE(t1.a, nn.a) a
, COALESCE(t1.b, nn.b) b
, COALESCE(t1.c, nn.c) c
, COALESCE(t1.d, nn.d) d
FROM t t1
LEFT JOIN nn
ON (t1.a IS NULL OR t1.a = nn.a)
AND (t1.b IS NULL OR t1.b = nn.b)
AND (t1.c IS NULL OR t1.c = nn.c)
AND (t1.d IS NULL OR t1.d = nn.d)
例如,如果插入两个额外的行,这将不起作用:
insert into t values (2, 10, 12, null), (2, null, 12, 10);
因为not-nulls(nn)的cte省略了两者。