Question

我在其中一个采访过程中遇到了问题。

表“TableA”包含1000万条记录。其中有近1000条重复记录。我们如何以最有效的方式删除这些重复项？

有人可以帮助提供效率最高的解决方案吗？

我提出的解决方案是，

创建临时表：使用数据

创建表tmp（从TableA中选择distinct *）

删除原始表

重新将数据从Tmp插入TableA

Answer 1

我不喜欢＆＃39;有一个teradata试一试。但你可以使用这样的东西：

delete 
from table 
where table.rowid not in 
(
select max(table.rowid) 
from table 
group by col1,col2,col3.....
)

Answer 2

从表中选择select语句中的Primary列。例如：EmailID，Mobile或任何唯一主键值并尝试此查询。 /

从table1 t1中选择t1.column1，t1.column2，其中（t1.column，t1.column2）in（选择t2.column1，t2.column2来自table1 t2 group by t2.column1，t2.column2 have count（* ）→1）;

Answer 3

由于假设表的数量太大，使用易失性表来保存重复数据删除的记录可能是最明智的方法。它也可能是最有效的（虽然没有经过测试）。

以下内容是有道理的：

/*A hypothetical very large table*/
CREATE MULTISET VOLATILE TABLE testtable
(
    f1 integer,
    f2 integer,
    f3 DATE
) 
    PRIMARY INDEX (f1, f2) 
    ON COMMIT PRESERVE ROWS;

INSERT INTO testtable VALUES (1,1, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,2, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,1, DATE '2017-02-01');
INSERT INTO testtable VALUES (1,3, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,3, DATE '2017-01-03');

/*assuming a key of f1 and f2 to identify a duplicate
 *and assuming that if we encounter a duplicate we want
 *to keep the newest one by the f3 date, then:
 *generate a volatile table to hold deduped recordsa
 *using a QUALIFY clause to perform duplicate identification
 */
CREATE MULTISET VOLATILE TABLE testtable_dedup AS
(
    SELECT * FROM testtable
    QUALIFY ROW_NUMBER() OVER (PARTITION BY f1, f2 /*key*/ ORDER BY f3 desc /*date for each key sorted descending*/) = 1 /*keep the newest record*/
) WITH DATA 
 PRIMARY INDEX (f1, f2)
 ON COMMIT PRESERVE ROWS;

/*show what records are being dropped*/
SELECT * FROM testtable
MINUS
SELECT * FROM testtable_dedup;

/*Delete everything*/
DELETE FROM testtable ALL;

/*And reload from the dedup volatile table*/
INSERT INTO testtable SELECT * FROM testtable_dedup;

SELECT * FROM testtable;

/*Clean up*/
DROP TABLE testtable_dedup;
DROP TABLE testtable;

从表中删除1000个重复项包含Teradata中的1000万条记录

3 个答案: