我习惯于使用单个命令在所有类型的数据库上进行重复数据删除,通常是这样的:
DELETE
FROM
table AS original
USING
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, ...
) AS rn
FROM
table
) AS other
WHERE
rn > 1 AND
original.col1 = other.col1 AND
original.col2 = other.col2 AND
original.col3 = other.col3 AND ...
;
这只会删除重复项,而每行的第一次出现都在后面,这是我所期望的。
我试图在BigQuery上复制该代码,而我几乎能够实现的唯一方法是使用MERGE
,并使用类似的语句:
MERGE
`table` orig
USING
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, ...
) AS rn
FROM
table
) AS other
ON
(orig.col1 = other.col1 OR (orig.col1 IS NULL AND other.col1 IS NULL)) AND
(orig.col2 = other.col2 OR (orig.col2 IS NULL AND other.col2 IS NULL)) AND
(orig.col3 = other.col3 OR (orig.col3 IS NULL AND other.col3 IS NULL)) AND ...
WHEN MATCHED AND other.rn > 1 THEN
DELETE
;
这种工作方式:它将删除所有重复的行,包括第一次出现的行。我认为这是BigQuey删除MERGE
上的内容的方式,就像与这些字段匹配的任何字段都将被删除一样,但是我需要保留第一个出现的字段。有什么想法吗?