Question

主题表有30M行。我正在执行查询以查找重复的hash。 hash定义为

 `hash` varchar(50) NOT NULL;

查询

SELECT Count(*)
FROM   (SELECT Count(id) `num`,
               `signature`
        FROM   `images`
        WHERE  `hash` IS NOT NULL
        GROUP  BY `hash`
        HAVING `num` > 1) AS `count_table`

在我按Ctrl+C中止查询之前大约需要5-7分钟。我再次运行它仍然不能等待那么多时间。

如何更快地获得此结果？

我知道在进行分组时这会有点慢。但我认为8分钟太多了。

Answer 1

确保hash上有一个索引（或signature？）。

将COUNT(id)替换为COUNT(hash)（或COUNT(signature)）。

Answer 2

如果您只需要重复条目的计数，那么您可以尝试

select count(*)-count(distinct hash) from images

Answer 3

如果我理解正确，你只想要返回重复的列吗？试试这个 SELECT * FROM table_name AS t1 WHERE EXISTS（SELECT * FROM table_name AS t2 WHERE t1.hash = t2.hash and t1.id！= t2.id）;

在30M记录表中查找重复列

3 个答案: