Question

我有下表单词，并存储英语单词。列为（id，单词，类型，说明，示例）。

这些单词是从vocabulary.com抓取的，我的表中有很多重复的行。 Words表中的行数为800k +，因此我想删除一些重复的行。虽然如此，

第1步：创建一个名为“ words_temp”的表，并将表“ words”中所有重复的单词插入“ words_temp”中；

第2步：删除表“ word_temp”中列“ word”的行；

第3步：合并两个表；

以下是我的条款：步骤1：

INSERT INTO words_tmp 
   (SELECT * 
       FROM words 
       GROUP BY word 
       HAVING COUNT(word) > 1 );

第2步：

DELETE FROM words 
   WHERE word in (SELECT word FROM words_tmp);

但是“步骤2”的速度非常慢，我等待执行10分钟，但是什么也没有发生。有人可以告诉我如何优化我的SQL子句吗？

Answer 1

您可以尝试使用以下方法将其删除：

delete w
    from words w join
         (select w.word, min(w.id) as min_id
          from words w
          group by w.word
         ) ww
         on w.word = ww.word and w.id > ww.min_id;

这假设id是唯一的。您可以使用words(word, id)上的索引来加快速度。

也就是说，最有效的方法通常是清空表并重新填充它：

create table temp_words as
    select w.*
    from words w
    where w.id > (select min(w2.id) from words w2 where w2.word = w.word);

然后，截断原始表并重新插入（在尝试此操作之前，请确保已在某处藏有原始表的副本！）：

truncate table words;

insert into words
    select w.*
    from temp_words;

与delete相比，插入的锁定和日志记录要少得多，因此通常可以加快此类过程。

Answer 2

如果您有足够的时间进行实验，请尝试

insert into words_tmp (select distinct * from words)

现在表中的words_tmp中只有没有重复的单词

Answer 3

我想到了两种解决方法：

如果重复项很少（例如，不超过行的5％），则按顺序浏览这些行，看看表中是否还有其他匹配项。

delete from words
where exists
(
  select *
  from words other
  where other.word = words.word
  and others.id < word.id
);

您应该为此提供单词和ID索引：

create index idx_words on words(word, id);

如果另一方面有很多重复项（例如，占行的20％），则创建一个具有唯一行的表，关闭原始表，然后使用新表。

create table tmp as select distinct * from words;
drop table words;
rename table tmp to words;

如何在MySQL中优化DELETE的速度

3 个答案: