Question

我知道此主题在此之前出现了很多次，但没有一个建议的解决方案适用于我的数据集，因为我的笔记本电脑由于内存问题或完全存储而停止计算。

我的表格如下所示，并且有108个Mio行：

Col1       |Col2   |  Col3           |Col4   |SICComb |  NameComb 

Case New   |3523   |  Alexander      |6799   |67993523| AlexanderCase New 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
Undisclosed|6799   |  Case New       |3523   |67993523| Case NewUndisclosed 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard 
SmartCard  |3674   |  Virtual NetComm|7373   |73733674| SmartCardVirtual NetComm 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard

唯一列为SICComb和NameComb。我尝试添加一个主键：

ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)

但整数只在新的分钟内填满了30 GB的存储空间。

哪种方法是从表中删除重复项的最快最有效的方法？

Answer 1

如果您使用的是SQL Server，则可以使用公用表表达式中的delete：

with cte as (
    select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
    from Table1
)
delete
from cte
where row_num > 1

此处所有行都会被编号，您为SICComb + NameComb的每个唯一组合获得自己的序列。您可以通过在order by子句中选择over来选择要删除的行。

Answer 2

通常，从表中删除重复项的最快方法是将记录（无重复项）插入临时表，截断原始表并将其重新插入。

以下是使用SQL Server语法的想法：

select distinct t.*
into #temptable
from t;

truncate table t;

insert into t
    select tt.*
    from #temptable;

当然，这在很大程度上取决于第一步的速度。并且，您需要有空间来存储同一个表的两个副本。

请注意，创建临时表的语法因数据库而异。有些使用create table as而不是select into的语法。

编辑：

您的身份插入错误很麻烦。我认为您需要从distinct列的列表中删除标识。或者做：

select min(<identity col>), <all other columns>
from t
group by <all other columns>

如果您有一个标识列，则没有重复项（根据定义）。

最后，您需要确定行所需的ID。如果您可以为行生成新的id，那么只需将标识列从插入列列表中删除：

insert into t(<all other columns>)
    select <all other columns>;

如果您需要旧的身份值（并且最小值将会这样做），请关闭身份插入并执行：

insert into t(<all columns including identity>)
    select <all columns including identity>;

从大型数据集中删除重复项（＆gt; 100Mio行）

2 个答案: