我在BigQuery中有一个包含5列的表,其中没有一列是唯一的id列。 我想检查此表中是否有任何重复的行。目前,我使用下面的查询执行此操作;
SELECT conc, COUNT(*) AS total
FROM (SELECT CONCAT(CAST(col1 AS STRING),
CAST(col2 AS STRING),
CAST(col3 AS STRING),
CAST(col4 AS STRING),
CAST(col5 AS STRING)) AS conc
FROM <table>)
GROUP BY conc
HAVING total > 1
有更简单的方法吗?因为我实际上想要为包含数十列的表格执行此操作。
答案 0 :(得分:2)
在这种情况下,我想检查此表中是否有任何重复的行
#standardSQL
SELECT TO_JSON_STRING(t) AS row, COUNT(1) AS total
FROM `project.dataset.your_table` t
GROUP BY row
HAVING total > 1
更新
我认为使用Hash Functions
可以提高效果。例如
#standardSQL
SELECT
MD5(TO_JSON_STRING(t)) AS id,
ANY_VALUE(TO_JSON_STRING(t)) AS row,
COUNT(1) AS total
FROM `project.dataset.your_table` t
GROUP BY id
HAVING total > 1