我有按插入时间划分的BQ表。 我正在尝试从表中删除重复项。这些是真正的重复项:对于2个重复的行,所有列均相等-当然,使用唯一键可能会有所帮助:-(
起初,我尝试使用SELECT查询来枚举重复项并将其删除:
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id_column) row_number
FROM
`mytable`)
WHERE
row_number = 1
这将导致唯一的行,但会创建一个不包含分区数据的新表-不好。
我见过这个answer here,它说明了保留分区的唯一方法是使用上面的查询一个一个地遍历它们并保存到特定的目标表分区。
我真正想做的是使用DML DELETE
删除适当的重复行。我尝试过类似this answer suggested的事情:
DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
FROM `mytable ` AS d2
WHERE d.id = d2.id) > 1;
但是接受的答案无效,并导致BQ错误:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
如果有人可以提供一种更简单的方式(DML或其他方式)来解决此问题,那将是很棒的,这样我就不需要单独遍历所有分区了。
答案 0 :(得分:5)
有点黑,但是您可以使用MERGE
语句删除表的所有内容,并原子地仅重新插入不同的行。这是一个示例:
-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));
现在进入MERGE
部分:
-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT (x, y, date) VALUES (t2.x, t2.y, t2.date)
WHEN NOT MATCHED BY SOURCE THEN DELETE
答案 1 :(得分:0)
您可以在一个SQL MERGE语句中完成此操作,而无需创建额外的表。
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
信用:https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a