Question

与here一样，我有一个大型表，用于存储系统中的所有事件，对于一个事件类型，我有重复的行（错误地从另一个系统多次导出）。我需要删除它们以清除统计数据。上面提出的解决方案是

将记录（无重复记录）插入临时表
截断原始表并将其重新插入。

但在我的情况下，我只需要删除一类事件，而不是所有行，impossible和truncate。我想知道我是否可以从postgres USING语法中受益，例如SO answer，它提供了以下解决方案 -

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

问题是我在这个大表中没有id字段。那么在这种情况下最快的决定是什么？临时表中的DELETE + INSERT是唯一的选择吗？

Answer 1

您可以使用ctid列作为＆＃34;替换ID＆＃34;：

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email 
  AND user_account.ctid < ua2.ctid;

虽然这引出了另一个问题：为什么你的user_accounts表没有主键？

但是如果删除表格中的大部分行，那么delete将永远不会非常有效（并且ctid上的比较也不会很快，因为它没有索引）。所以delete很可能需要很长时间。

对于一次性操作，如果您需要删除许多行，那么将要保留的那些行插入到中间表中将会更快。

只需保留中间表而不是将行复制回原始表，即可改进该方法。

-- this will create the same table including indexes and not null constraint
-- but NOT foreign key constraints!
create table temp (like user_accounts including all);

insert into temp 
select distinct ... -- this is your query that removes the duplicates
from user_accounts;

 -- you might need cascade if the table is referenced by others
drop table user_accounts;

alter table temp rename to user_accounts;

commit;

唯一的缺点是你必须为原始表重新创建外键（fks引用原始表和外键从原始表到另一个）。

从大（> 100 MIo）postgresql表中删除重复行（截断条件？）

1 个答案: