Question

我有大型（＆gt; Mil行）MySQL数据库被重复搞乱。我认为它可能是填满它们的整个数据库的1/4到1/2。我需要快速摆脱它们（我的意思是查询执行时间）。这是它的外观：
id（索引）| text1 | text2 |文字3
text1＆amp; text2组合应该是唯一的，如果有任何重复项，则只保留text3 NOT NULL的一个组合。示例：

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL

...变成：

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

新ids冷酷无所谓，他们不依赖于旧桌子ids 我尝试过这样的事情：

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

或SELECT DISTINCT和其他变体。
虽然他们在小型数据库上工作，但我的查询执行时间非常庞大（实际上从未到达终点;> 20分钟）

有没有更快的方法呢？请帮我解决这个问题。

Answer 1

我相信这样做会使用重复键+ ifnull（）：

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

应该比任何需要group by或distinct或子查询，甚至order by的东西快得多。这甚至不需要文件排序，这会破坏大型临时表的性能。仍然需要对原始表格进行全面扫描，但是没有避免这种情况。

Answer 2

发现这个简单的1行代码可以完全满足我的需求：

ALTER IGNORE TABLE dupTest ADD UNIQUE INDEX(a,b);

取自： http://mediakey.dk/~cc/mysql-remove-duplicate-entries/

Answer 3

DELETE FROM dups
WHERE id NOT IN(
    SELECT id FROM (
        SELECT DISTINCT id, text1, text2
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC
    ) as tmp
)

这将通过区分字段和按ID的顺序查询所有记录，组（意味着我们选择第一个非空text3记录）。然后我们从该结果中选择id（这些是好的ID ......它们不会被删除）并删除所有不属于那些的ID。

这样影响整个表的任何查询都会很慢。你只需要运行它并让它推出，这样你就可以在将来阻止它。

完成此“修复”后，我会将UNIQUE INDEX（text1，text2）应用于该表。为了防止将来重复的可能性。

如果你想去“创建一个新表并替换旧表”的路线。您可以使用内部的select语句来创建insert语句。

MySQL特定（假设新表名为my_tbl2并具有完全相同的结构）：

INSERT INTO my_tbl2
    SELECT DISTINCT id, text1, text2, text3
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC

有关详细信息，请参阅MySQL INSERT ... SELECT。

Answer 4

删除重复项而不删除外键

create table tmp like mytable;
ALTER TABLE tmp ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6);
insert IGNORE into tmp select * from mytable;
delete from mytable where id not in ( select id from tmp);

Answer 5

如果您可以创建新表，请使用text1 + text2字段上的唯一键。然后插入表中忽略错误（使用INSERT IGNORE语法）：

select * from my_tbl order by text3 desc

我认为text3 desc的顺序会将NULL置于最后，但请仔细检查。

所有这些列的索引可能会有很大帮助，但现在创建它们可能会很慢。

Answer 6

对于重复较少的大型表，您可能希望避免将整个表复制到另一个位置。一种方法是创建一个临时表，其中包含您要保留的行（对于每个具有重复项的键），然后从原始表中删除重复项。

举例here。

Answer 7

我对MySQL没有多少经验。如果它具有分析功能，请尝试：

delete from my_tbl
 where id in (
     select id 
       from (select id, row_number()
                            over (partition by text1, text2 order by text3 desc) as rn
               from my_tbl
               /* optional: where text1 like 'a%'  */
             ) as t2
       where rn > 1
     )

可选的where子句使得你必须多次运行它，每个字母一个，等等。在text1上创建一个索引？

在运行之前，请确认“text desc”将在MySQL中排除最后的空值。

Answer 8

我知道这是一个旧线程，但我有一些凌乱的方法，速度更快，可定制，就速度而言，我会说10秒而不是100秒（10：1）。< / p>

我的方法确实需要您尝试避免的所有凌乱内容：

分组（和拥有）
使用ORDER BY组连接
2个临时表
使用磁盘上的文件！
以某种方式（php？）删除

但是，当你谈论数百万（或者在我的情况下是数万万）时，这是值得的。

无论如何它并不多，因为评论是葡萄牙语，但这是我的样本：

编辑：如果我收到评论，我会进一步解释它是如何运作的：）

START TRANSACTION;

DROP temporary table if exists to_delete;

CREATE temporary table to_delete as (
    SELECT
        -- escolhe todos os IDs duplicados menos os que ficam na BD
        -- A ordem de escolha dos IDs é dada por "ORDER BY campo_ordenacao DESC" em que o primeiro é o que fica
        right(
            group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','),
            length(group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) 
                - locate(",",group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','))
        ) as ids,

        count(*) as c

    -- Tabela a eliminar duplicados
    FROM teste_dup

    -- campos a usar para identificar  duplicados
    group by test_campo1, test_campo2, teste_campoN
    having count(*) > 1 -- é duplicado
);

-- aumenta o limite desta variável de sistema para o máx 
SET SESSION group_concat_max_len=4294967295;

-- envia os ids todos a eliminar para um ficheiro
select group_concat(ids SEPARATOR ',') from to_delete INTO OUTFILE 'sql.dat';

DROP temporary table if exists del3;
create temporary table del3 as (select CAST(1 as signed) as ix LIMIT 0);

-- insere os ids a eliminar numa tabela temporaria a partir do ficheiro
load data infile 'sql.dat' INTO TABLE del3
LINES TERMINATED BY ',';

alter table del3 add index(ix);

-- elimina os ids seleccionados
DELETE teste_dup -- tabela 
from teste_dup -- tabela

join del3 on id=ix;

COMMIT;

Answer 9

您可以使用此简单查询删除所有重复条目。这将选择所有重复记录并删除它们。

 DELETE i1 
FROM TABLE i1
LEFT JOIN TABLE i2
  ON i1.id = i2.id
 AND i1.colo = i2.customer_invoice_id
 AND i1.id < i2.id
WHERE i2.customer_invoice_id IS NOT NULL

MySQL快速从大数据库中删除重复项

9 个答案: