Question

Vertica允许将重复项插入表中。我可以使用'analyze_constraints'函数查看它们。如何从Vertica表中删除重复的行？

Answer 1

您应该尝试使用具有大量记录的DELETE来避免/限制。以下方法应该更有效：

第1步创建一个新表，其结构/投影与包含重复项的结构/投影相同：

create table mytable_new like mytable including projections ;

第2步在这个新表中插入重复数据删除的行：

insert /* +direct */ into mytable_new select <column list> from (
    select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;

第3步重命名原始表（包含重复的表）：

alter table mytable rename to mytable_orig ;

第4步重命名新表：

alter table mytable_new rename to mytable ;

这就是全部。

Answer 2

离开我的头顶，而不是一个好的答案所以让我们把它作为最后一个词，你可以删除它们并插入一个。

Answer 3

Mauro的答案是正确的，但是第2步的sql中有一个错误。因此，避免DELETE的完整工作方式应该如下：

第1步创建一个新表，其结构/投影与包含重复项的结构/投影相同：

create table mytable_new like mytable including projections ;

第2步在这个新表中插入重复数据删除的行：

insert /* +direct */ into mytable_new select <column list> from (
            select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
    ) a where a.rownum = 1 ;

第3步重命名原始表（包含重复的表）：

alter table mytable rename to mytable_orig ;

第4步重命名新表：

alter table mytable_new rename to mytable ;

Answer 4

您可以通过创建临时表并生成伪row_id来删除Vertica表的重复项。这里有几个步骤，特别是如果要从非常大和宽的表中删除重复项。在下面的例子中，我假设，k1和k2行有多于1个重复。有关详细信息see here。

-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc  ;

-- Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');     -- where, say, k1 has n and k2 has m exact dups

-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    
--  Sanity test.  Should have 1 row each of k1 & k2 rows above

-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');

-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

Answer 5

你应该看一下PostgreSQL wiki的答案，这个答案也适用于Vertica：

DELETE
FROM
    tablename
WHERE
    id IN(
        SELECT
            id
        FROM
            (
                SELECT
                    id,
                    ROW_NUMBER() OVER(
                        partition BY column1,
                        column2,
                        column3
                    ORDER BY
                        id
                    ) AS rnum
                FROM
                    tablename
            ) t
        WHERE
            t.rnum > 1
    );

删除所有重复的条目，但删除id最低的条目。

删除Vertica数据库中的重复行

5 个答案: