如何从Google BigQuery中的非唯一记录生成唯一ID

时间:2018-05-04 07:30:01

标签: google-bigquery

我在BigQuery中有一个包含5列的表,其中没有一列是唯一的id列。 我想检查此表中是否有任何重复的行。目前,我使用下面的查询执行此操作;

SELECT conc, COUNT(*) AS total FROM (SELECT CONCAT(CAST(col1 AS STRING), CAST(col2 AS STRING), CAST(col3 AS STRING), CAST(col4 AS STRING), CAST(col5 AS STRING)) AS conc FROM <table>) GROUP BY conc HAVING total > 1

有更简单的方法吗?因为我实际上想要为包含数十列的表格执行此操作。

1 个答案:

答案 0 :(得分:2)

  

我想检查此表中是否有任何重复的行

在这种情况下,

TO_JSON_STRING()很有用

#standardSQL
SELECT TO_JSON_STRING(t) AS row, COUNT(1) AS total
FROM `project.dataset.your_table` t
GROUP BY row
HAVING total > 1
  

更新

我认为使用Hash Functions可以提高效果。例如

#standardSQL
SELECT 
  MD5(TO_JSON_STRING(t)) AS id, 
  ANY_VALUE(TO_JSON_STRING(t)) AS row, 
  COUNT(1) AS total
FROM `project.dataset.your_table` t
GROUP BY id
HAVING total > 1