Question

我想要做的是获取几乎完全重复的所有记录，除了重复项在'name'开头没有额外的字符

这是我的SQL查询：

select * from tags as spaced inner join tags as not_spaced on not_spaced.name = substring(spaced.name, 2);

我也尝试过：

 select * from tags as spaced where (select count(*) from tags as not_spaced where not_spaced.name = substring(spaced.name, 2)) > 0;

我得到的是...... SQL连接停止响应。谢谢！

P.S。对不起，我没有提到我需要的唯一字段是名称。所有其他字段都是微不足道的（如果存在）。

Answer 1

尝试这样的事情：

select 所有可能重复的字段除了名称 , name

from tags union all

select 所有可能重复的字段除了名称 , substring(name, 2) name

from tags

group by 所有可能重复的字段包括名称

having count(*) > 1

Answer 2

如果表格非常大，请在index和name 上设置substring(name,2)以使其更快：

select t1.* from tags t1
inner join tags t2 on t1.name = substring(t2.name, 2)

Answer 3

即使使用索引，您的查询也会要求spaced中的每条记录都根据tags中的每条记录进行检查。

如果每张表有1,000条记录，那就是1,000,000种组合。

您可能最好只使用两个字段spaced.id, substring(t2.name, 2) as shortname创建临时表，然后索引短名称字段。加入临时表和索引表会快得多。

Answer 4

在不知道数据库，表格如何编制索引等情况下，它只是尝试不同的事情，直到一个人得到更好的优化......

您可以尝试以下其他查询：

SELECT name, count(*) c FROM (
    SELECT name FROM tags
    UNION ALL
    SELECT substring(name, 2) AS name FROM tags
) AS t
GROUP BY name