我想清理评论表(1M行)上的重复内容,其中用户已发布两次(或更多)相同的评论。但是我想保留一个重复评论的实例。
以下是我提出的查询,用于查找和分组这些注释:
SELECT author, body, COUNT(*) as count
FROM db.comment
GROUP BY body
HAVING COUNT(*) > 1;
但是不知道如何删除重复的行而只留下一个未触动过的行。 我见过类似的问题,但没有一个对我有用。非常感谢您的提示。
更新:
mysql> describe comment;
+---------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| created | datetime | NO | | NULL | |
| author | varchar(60) | NO | | NULL | |
| body | longtext | NO | | NULL | |
| post_id | int(11) | NO | MUL | NULL | |
+---------+-------------+------+-----+---------+----------------+
答案 0 :(得分:1)
与其他DBMS不同,MySQL可以从表中选择所有字段,但只能通过其中一个进行分组。在这种情况下,只会选择每组的第一条记录。
分两步完成工作:
保存ID以保留在临时表中:
INSERT INTO temp_comment(id)
SELECT id
FROM db.comment
GROUP BY author, body
删除除已保存的行以外的所有行:
DELETE FROM db.comment WHERE id NOT IN (SELECT id FROM temp_comment);
当然,您需要temp_comment
表存在。
答案 1 :(得分:1)
这是你想要的吗?
SELECT * FROM comments WHERE id NOT IN (
SELECT id
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
)
AND author IN(
SELECT author
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
)
AND body IN(
SELECT body
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
);
要delete
重复的行,请将SELECT *
更改为DELETE
<强> 更新 强>
要提高查询性能,可以尝试以下方法:
SELECT * FROM comments c
INNER JOIN
(
SELECT id,author,body
FROM comments
GROUP BY author,body
HAVING COUNT(*) > 1
) AS t
ON c.id NOT IN(t.id) AND c.author IN(t.author) AND c.body IN(t.body)