我正在尝试删除redshift表中的一些重复数据。
以下是我的询问: -
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
此查询给了我一个错误。
亚马逊无效操作:语法错误在或附近"删除&#34 ;;
不确定问题是什么,因为with子句的语法似乎是正确的。 以前有人遇到过这种情况吗?
答案 0 :(得分:19)
Redshift就是这样(任何专栏都没有强制执行的唯一性),Ziggy的第三个选项可能是最好的。一旦我们决定采用临时表路线,就可以更有效地将事情全部换掉。在Redshift中删除和插入是很昂贵的。
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
如果空间不是问题,您可以将旧表保留一段时间并使用此处描述的其他方法来验证重复项的原始记帐中的行计数是否与新计数中的行计数相匹配。 / p>
如果您正在对此类表格进行持续加载,那么您希望在此过程中暂停该过程。
如果重复项的数量只占大表的一小部分,则可能需要尝试将重复项的不同记录复制到临时表,然后从原始表中删除与temp连接的所有记录。然后将append临时表恢复为原始表。确保你vacuum之后的原始表(无论如何你应该按照计划对大表进行处理)。
答案 1 :(得分:10)
如果你处理大量数据,重建整个表并不总是可行或聪明的。找到并删除这些行可能更容易:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
完整文章:https://elliot.land/post/removing-duplicate-data-in-redshift
答案 2 :(得分:5)
那应该有效。您可以选择:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
或
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
如果您没有主键,则可以执行以下操作:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
答案 3 :(得分:3)
以下内容删除了'tablename'中具有重复项的所有记录,它不会对表进行重复数据删除:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
答案 4 :(得分:3)
这个问题的简单回答:
row_number=1
。delete
我们重复的主表中的所有行。查询:
临时表
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
删除主表中的所有行。
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
将临时表中的所有值插入主表
insert into table a select * from #temp_a
。
答案 5 :(得分:1)
您的查询不起作用,因为Redshift在DELETE
子句后不允许WITH
。仅允许SELECT
和UPDATE
以及其他一些人(请参阅WITH clause)
解决方案(在我的情况下):
我的表events
上有一个包含重复行并唯一标识记录的id列。此列id
与您的record_indicator
相同。
不幸的是我无法创建临时表,因为我使用SELECT DISTINCT
遇到了以下错误:
ERROR: Intermediate result row exceeds database block size
但这就像一个魅力:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
产生temp
表:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
现在可以通过删除rownumber
大于1的行来删除重复项:
DELETE FROM temp WHERE rownumber > 1
之后重命名表格并完成。
答案 6 :(得分:0)
此方法将保留权限和original_table
的表定义
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
original_table
TRUNCATE original_table
unique_table
插入original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;