如您所知,很多名称都有多种拼写。
我有一个包含名字和姓氏的数据集,但是我在拼写方面存在问题。
这是数据集中的样本:
# Set compiler/build options
params = dijitso.params.default_params()
params['build']['include_dirs'] = list(dolfin_pc["include_dirs"])
params['build']['libs'] = list(dolfin_pc["libraries"])
params['build']['lib_dirs'] = list(dolfin_pc["library_dirs"])
所以我想让所有人都叫“ Mathew”:
Matthew,Mathew和Matthieu
或者名字或姓氏是“ Hamada”的人:
滨田,7amada,7mada
我试图用相应的字母替换这些数字,然后使用get_close_matches函数,但是它既不准确也不是Pythonic。
编辑:
我认为最好将所有多种拼写替换为流行的拼写(无论是第一个还是最后一个)。因此,如果 firstName lastName
0 Ali Khaled
1 Hamada 5ald
2 3ly 7mada
3 7amada 5aled
4 Sophia Andrew
5 Sofiya Jaxon
6 Matthieu Jackson
7 Matthieu Jozeph
8 Mathew Andru
,则将“ Mathew”和“ Matthieu”替换为“ Matthew”
答案 0 :(得分:1)
您可以执行以下操作以将接近的匹配分组并将其作为新列返回:
from difflib import get_close_matches as gsm
df['Close_Matches'] = [', '.join(gsm(name, df.firstName)) for name in df.firstName]
print(df)
firstName lastName Close_Matches
0 Ali Khaled Ali
1 Hamada 5ald Hamada, 7amada
2 3ly 7mada 3ly
3 7amada 5aled 7amada, Hamada
4 Sophia Andrew Sophia, Sofiya
5 Sofiya Jaxon Sofiya, Sophia
6 Matthieu Jackson Matthieu, Matthieu, Mathew
7 Matthieu Jozeph Matthieu, Matthieu, Mathew
8 Mathew Andru Mathew, Matthieu, Matthieu
答案 1 :(得分:0)
问题在于,“同名拼写不同”的概念取决于语音。人们通过听两个名字的发音并说“嘿,这些听起来一样”来确定这一点。计算机唯一可能知道“ Matthew”和“ Matthieu”是“同名”的唯一方法是将某种类型的文本语音转换为音频分析。
由于这很可能不是您想要执行的操作,因此您真正要看的唯一一件事就是汉明距离,并定义一些阈值(也许是1个字符),您可以将其称为“同名”。这很可能是get_close_matches()所做的,但是将其作为与单词长度的比率进行评分。但是,即使那样也会产生误报(即使我现在想不起来,肯定会有汉明距离为1的不同名称),并且直到您将您正确地分组为“ Haley”和“ Hayleigh”之类的名称将该阈值提高到4,然后您将有很多误报。
更不用说名称不需要任何语音发音了。我可以给儿子取一个名字“ a”,然后给其发音“ Jared”。您怎么可能发现这是“杰罗德”的替代拼写?您不能,因此不能以编程方式确定两个名称是否“相同”。问题在于问题本身定义不明确。您最好说出您想将“在语音上相同”的名称组合在一起来定义它。这样一来,您就可以跳过人为设计的示例,例如“ a”,但您只是将该问题换成了对某种语音引擎的需求,而这并非易事。
tl; dr不可能
答案 2 :(得分:0)
要查找两个单词\句子之间的相似性,您可能需要使用诸如 Edit Distance 或 Jaccard Distance 之类的东西。
让我们使用 Edit Distance 来进行测试:
firstName = ['Ali', 'Hamada', '3ly', '7amada', 'Sophia', 'Sofiya', 'Matthieu', 'Matthieu', 'Mathew']
#No need to implement the distance function, you can call it from NLTK
import nltk
# Find similier first name using edit distance
for name in firstName:
nameToCompare = [x for x in firstName if x != name]
for n in nameToCompare:
print(name, n, nltk.edit_distance(name, n))
print('***************')
# Ali Hamada 6
# Ali 3ly 2
# Ali 7amada 6
# Ali Sophia 5
# Ali Sofiya 5
# Ali Matthieu 7
# Ali Matthieu 7
# Ali Mathew 6
#***************
# Hamada Ali 6
# Hamada 3ly 6
# Hamada 7amada 1
# Hamada Sophia 5
# Hamada Sofiya 5
# Hamada Matthieu 7
# Hamada Matthieu 7
# Hamada Mathew 5
#***************
# 3ly Ali 2
# 3ly Hamada 6
# 3ly 7amada 6
# 3ly Sophia 6
# 3ly Sofiya 5
# 3ly Matthieu 8
# 3ly Matthieu 8
# 3ly Mathew 6
#***************
# 7amada Ali 6
# 7amada Hamada 1
# 7amada 3ly 6
# 7amada Sophia 5
# 7amada Sofiya 5
# 7amada Matthieu 7
# 7amada Matthieu 7
# 7amada Mathew 5
#***************
# Sophia Ali 5
# Sophia Hamada 5
# Sophia 3ly 6
# Sophia 7amada 5
# Sophia Sofiya 3
# Sophia Matthieu 6
# Sophia Matthieu 6
# Sophia Mathew 5
#***************
# Sofiya Ali 5
# Sofiya Hamada 5
# Sofiya 3ly 5
# Sofiya 7amada 5
# Sofiya Sophia 3
# Sofiya Matthieu 7
# Sofiya Matthieu 7
# Sofiya Mathew 6
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Mathew Ali 6
# Mathew Hamada 5
# Mathew 3ly 6
# Mathew 7amada 5
# Mathew Sophia 5
# Mathew Sofiya 6
# Mathew Matthieu 3
# Mathew Matthieu 3
#***************
小数字表示它更相似。您会注意到,它可以识别具有不同拼写的相似鬃毛。
现在,我们应用抽卡距离
for name in firstName:
nameToCompare = [x for x in firstName if x != name]
for n in nameToCompare:
print(name, n, (1-nltk.jaccard_distance(set(name), set(n)))*100)
print('***************')
# Ali Hamada 0.0
# Ali 3ly 19.999999999999996
# Ali 7amada 0.0
# Ali Sophia 12.5
# Ali Sofiya 12.5
# Ali Matthieu 11.111111111111116
# Ali Matthieu 11.111111111111116
# Ali Mathew 0.0
#***************
# Hamada Ali 0.0
# Hamada 3ly 0.0
# Hamada 7amada 60.0
# Hamada Sophia 11.111111111111116
# Hamada Sofiya 11.111111111111116
# Hamada Matthieu 9.999999999999998
# Hamada Matthieu 9.999999999999998
# Hamada Mathew 11.111111111111116
#***************
# 3ly Ali 19.999999999999996
# 3ly Hamada 0.0
# 3ly 7amada 0.0
# 3ly Sophia 0.0
# 3ly Sofiya 12.5
# 3ly Matthieu 0.0
# 3ly Matthieu 0.0
# 3ly Mathew 0.0
#***************
# 7amada Ali 0.0
# 7amada Hamada 60.0
# 7amada 3ly 0.0
# 7amada Sophia 11.111111111111116
# 7amada Sofiya 11.111111111111116
# 7amada Matthieu 9.999999999999998
# 7amada Matthieu 9.999999999999998
# 7amada Mathew 11.111111111111116
#***************
# Sophia Ali 12.5
# Sophia Hamada 11.111111111111116
# Sophia 3ly 0.0
# Sophia 7amada 11.111111111111116
# Sophia Sofiya 50.0
# Sophia Matthieu 30.000000000000004
# Sophia Matthieu 30.000000000000004
# Sophia Mathew 19.999999999999996
#***************
# Sofiya Ali 12.5
# Sofiya Hamada 11.111111111111116
# Sofiya 3ly 12.5
# Sofiya 7amada 11.111111111111116
# Sofiya Sophia 50.0
# Sofiya Matthieu 18.181818181818176
# Sofiya Matthieu 18.181818181818176
# Sofiya Mathew 9.090909090909093
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Mathew Ali 0.0
# Mathew Hamada 11.111111111111116
# Mathew 3ly 0.0
# Mathew 7amada 11.111111111111116
# Mathew Sophia 19.999999999999996
# Mathew Sofiya 9.090909090909093
# Mathew Matthieu 62.5
# Mathew Matthieu 62.5
#***************
我们也取得了不错的成绩!
希望获得帮助