获取类似声音的记录

时间:2014-04-10 09:56:09

标签: mysql sql

我想查找姓名与声音匹配的联系人表格中的所有重复名称。例如:Rita或Reeta,Microsoft或Microsift,Mukherjee或Mukherji。

我使用了以下查询:

SELECT contacts.id 
FROM contacts 
INNER JOIN (
    SELECT first_name, last_name, count(*) AS rows 
    FROM contacts 
    WHERE deleted = 0 
    GROUP BY SOUNDEX(first_name), SOUNDEX(last_name) 
    HAVING count(rows) > 1
) AS p 
WHERE contacts.deleted = 0 
AND p.first_name SOUNDS LIKE contacts.first_name 
AND p.last_name SOUNDS LIKE contacts.last_name 
ORDER BY contacts.date_entered DESC

上述查询给出了正确的结果,但是当记录很多时会花费很多时间。

2 个答案:

答案 0 :(得分:0)

我不知道比SOUNDEX()更好的(原生)方法。它之所以慢,是因为它是一个功能,因此需要处理所有记录以计算价值并从那里开始工作。解决这个问题的方法是将结果直接存储到表中。我没有在MySQL中使用这些函数的经验,但根据documentation,您似乎可以将WHERE子句转换为

[...] AND SOUNDEX(p.first_name) = SOUNDEX(contacts.first_name) [...]

因此,如果您已预先计算(并编制索引!)这些值,则搜索匹配记录的速度应该更快!

那说我很难搞清楚你的​​查询。我认为你不需要那里的HAVING COUNT(*) > 1,即便如此,我对你想如何分组/过滤联系人感到困惑!?

你想要这样的东西:

SELECT c1.id as contact_id, 
       c2.id as similar_id
  FROM contacts c1 
  JOIN contacts c2
    ON c2.id <> c1.id
   AND c2.deleted = 0
   AND SOUNDEX(c2.first_name) = SOUNDEX(c1.first_name)
   AND SOUNDEX(c2.last_name) = SOUNDEX(c1.last_name)
 WHERE c1.deleted = 0 
ORDER BY c1.date_entered DESC

然后您可以根据上面的建议选择

SELECT c1.id as contact_id, 
       c2.id as similar_id
  FROM contacts c1 
  JOIN contacts c2
    ON c2.id <> c1.id
   AND c2.deleted = 0
   AND c2.first_name_soundex = c1.first_name_soundex
   AND c2.last_name_soundex = c1.last_name_soundex
 WHERE c1.deleted = 0 
ORDER BY c1.date_entered DESC

其中first_name_soundex包含SOUNDEX(first_name)等的结果。 建立索引时,您可能希望覆盖索引超过deletedfirst_name_soundexlast_name_soundex。 (AFAIK MySQL尚未支持过滤索引,否则您可以仅将索引限制为deleted = 0

答案 1 :(得分:0)

SOUNDEX是(恕我直言)实用性非常有限。一个极端的例子......

SELECT SOUNDEX('cholmondley');
+------------------------+
| SOUNDEX('cholmondley') |
+------------------------+
| C4534                  |
+------------------------+

SELECT SOUNDEX('chumleigh');
+----------------------+
| SOUNDEX('chumleigh') |
+----------------------+
| C542                 |
+----------------------+