BigQuery - DELETE语句删除重复项

时间:2018-01-06 18:08:50

标签: google-bigquery

SQL上有很多很棒的帖子可以选择唯一的行并写入(截断)表格,以便删除dus。 e.g

WITH ev AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
 )
SELECT
  * EXCEPT(rowNum)
FROM
  ev
WHERE rowNum = 1

我试图使用DML和DELETE稍微改变一下(例如,如果你不想使用BQ savedQuery,只需执行SQL)。我想做的大致是:

WITH dup_events AS (
  SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
      FROM `duplicates`
 )
DELETE FROM
  dup_events
WHERE rowNum > 1

但在控制台中出现此错误:

Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]

可以使用DELETE实现(standardSQL)吗?

谢谢!

4 个答案:

答案 0 :(得分:9)

实际上:o)工作

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)  

注意:它假设loadTime也是唯一的 - 这意味着如果给定的id有多个带有最新loadTime的记录 - 它们都将被保留

答案 1 :(得分:3)

syntax documentation开始,DELETE的参数需要是一个表,并且没有使用WITH子句的规定。这是有道理的,因为您无法从基本上是逻辑视图(CTE)中删除。您可以通过将逻辑放在过滤器中来表达您想要的内容,例如

DELETE
FROM duplicates AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC)
       FROM `duplicates` AS d2
       WHERE d.id = d2.id AND d.loadTime = d2.loadTime) > 1;

答案 2 :(得分:2)

这必须是最简单的方法:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)

如果您有数组数据类型,请尝试以下操作:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);

答案 3 :(得分:0)

以上答案仅适用于小尺寸桌子。如果分区表很大,并且只想删除给定范围内的重复项,请使用以下SQL:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

信用:https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

相关问题