我想查找姓名与声音匹配的联系人表格中的所有重复名称。例如:Rita或Reeta,Microsoft或Microsift,Mukherjee或Mukherji。
我使用了以下查询:
SELECT contacts.id
FROM contacts
INNER JOIN (
SELECT first_name, last_name, count(*) AS rows
FROM contacts
WHERE deleted = 0
GROUP BY SOUNDEX(first_name), SOUNDEX(last_name)
HAVING count(rows) > 1
) AS p
WHERE contacts.deleted = 0
AND p.first_name SOUNDS LIKE contacts.first_name
AND p.last_name SOUNDS LIKE contacts.last_name
ORDER BY contacts.date_entered DESC
上述查询给出了正确的结果,但是当记录很多时会花费很多时间。
答案 0 :(得分:0)
我不知道比SOUNDEX()
更好的(原生)方法。它之所以慢,是因为它是一个功能,因此需要处理所有记录以计算价值并从那里开始工作。解决这个问题的方法是将结果直接存储到表中。我没有在MySQL中使用这些函数的经验,但根据documentation,您似乎可以将WHERE
子句转换为
[...] AND SOUNDEX(p.first_name) = SOUNDEX(contacts.first_name) [...]
因此,如果您已预先计算(并编制索引!)这些值,则搜索匹配记录的速度应该更快!
那说我很难搞清楚你的查询。我认为你不需要那里的HAVING COUNT(*) > 1
,即便如此,我对你想如何分组/过滤联系人感到困惑!?
你想要这样的东西:
SELECT c1.id as contact_id,
c2.id as similar_id
FROM contacts c1
JOIN contacts c2
ON c2.id <> c1.id
AND c2.deleted = 0
AND SOUNDEX(c2.first_name) = SOUNDEX(c1.first_name)
AND SOUNDEX(c2.last_name) = SOUNDEX(c1.last_name)
WHERE c1.deleted = 0
ORDER BY c1.date_entered DESC
然后您可以根据上面的建议选择
SELECT c1.id as contact_id,
c2.id as similar_id
FROM contacts c1
JOIN contacts c2
ON c2.id <> c1.id
AND c2.deleted = 0
AND c2.first_name_soundex = c1.first_name_soundex
AND c2.last_name_soundex = c1.last_name_soundex
WHERE c1.deleted = 0
ORDER BY c1.date_entered DESC
其中first_name_soundex包含SOUNDEX(first_name)等的结果。
建立索引时,您可能希望覆盖索引超过deleted
,first_name_soundex
,last_name_soundex
。
(AFAIK MySQL尚未支持过滤索引,否则您可以仅将索引限制为deleted = 0
。
答案 1 :(得分:0)
SOUNDEX是(恕我直言)实用性非常有限。一个极端的例子......
SELECT SOUNDEX('cholmondley');
+------------------------+
| SOUNDEX('cholmondley') |
+------------------------+
| C4534 |
+------------------------+
SELECT SOUNDEX('chumleigh');
+----------------------+
| SOUNDEX('chumleigh') |
+----------------------+
| C542 |
+----------------------+