删除Vertica数据库中的重复行

时间:2013-06-19 20:00:57

标签: vertica

Vertica允许将重复项插入表中。我可以使用'analyze_constraints'函数查看它们。 如何从Vertica表中删除重复的行?

5 个答案:

答案 0 :(得分:5)

您应该尝试使用具有大量记录的DELETE来避免/限制。以下方法应该更有效:

第1步创建一个新表,其结构/投影与包含重复项的结构/投影相同:

create table mytable_new like mytable including projections ;

第2步在这个新表中插入重复数据删除的行:

insert /* +direct */ into mytable_new select <column list> from (
    select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;

第3步重命名原始表(包含重复的表):

alter table mytable rename to mytable_orig ;

第4步重命名新表:

alter table mytable_new rename to mytable ;

这就是全部。

答案 1 :(得分:2)

离开我的头顶,而不是一个好的答案所以让我们把它作为最后一个词,你可以删除它们并插入一个。

答案 2 :(得分:2)

Mauro的答案是正确的,但是第2步的sql中有一个错误。因此,避免DELETE的完整工作方式应该如下:

第1步创建一个新表,其结构/投影与包含重复项的结构/投影相同:

create table mytable_new like mytable including projections ;

第2步在这个新表中插入重复数据删除的行:

insert /* +direct */ into mytable_new select <column list> from (
            select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
    ) a where a.rownum = 1 ;

第3步重命名原始表(包含重复的表):

alter table mytable rename to mytable_orig ;

第4步重命名新表:

alter table mytable_new rename to mytable ;

答案 3 :(得分:1)

您可以通过创建临时表并生成伪row_id来删除Vertica表的重复项。这里有几个步骤,特别是如果要从非常大和宽的表中删除重复项。在下面的例子中,我假设,k1和k2行有多于1个重复。有关详细信息see here

-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc  ;

-- Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');     -- where, say, k1 has n and k2 has m exact dups

-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    
--  Sanity test.  Should have 1 row each of k1 & k2 rows above

-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');

-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

答案 4 :(得分:-2)

你应该看一下PostgreSQL wiki的答案,这个答案也适用于Vertica:

DELETE
FROM
    tablename
WHERE
    id IN(
        SELECT
            id
        FROM
            (
                SELECT
                    id,
                    ROW_NUMBER() OVER(
                        partition BY column1,
                        column2,
                        column3
                    ORDER BY
                        id
                    ) AS rnum
                FROM
                    tablename
            ) t
        WHERE
            t.rnum > 1
    );

删除所有重复的条目,但删除id最低的条目。