如何从数据框中返回“拼写不同”的名称

时间:2019-04-15 23:41:40

标签: python-3.x pandas dataframe

如您所知,很多名称都有多种拼写。

我有一个包含名字和姓氏的数据集,但是我在拼写方面存在问题。

这是数据集中的样本:

# Set compiler/build options 
params = dijitso.params.default_params() 
params['build']['include_dirs'] = list(dolfin_pc["include_dirs"]) 
params['build']['libs'] = list(dolfin_pc["libraries"]) 
params['build']['lib_dirs'] = list(dolfin_pc["library_dirs"])

所以我想让所有人都叫“ Mathew”:
   Matthew,Mathew和Matthieu

或者名字或姓氏是“ Hamada”的人:
滨田,7amada,7mada

我试图用相应的字母替换这些数字,然后使用get_close_matches函数,但是它既不准确也不是Pythonic。

编辑
我认为最好将所有多种拼写替换为流行的拼写(无论是第一个还是最后一个)。因此,如果 firstName lastName 0 Ali Khaled 1 Hamada 5ald 2 3ly 7mada 3 7amada 5aled 4 Sophia Andrew 5 Sofiya Jaxon 6 Matthieu Jackson 7 Matthieu Jozeph 8 Mathew Andru ,则将“ Mathew”和“ Matthieu”替换为“ Matthew”

3 个答案:

答案 0 :(得分:1)

您可以执行以下操作以将接近的匹配分组并将其作为新列返回:


from difflib import get_close_matches as gsm

df['Close_Matches'] = [', '.join(gsm(name, df.firstName)) for name in df.firstName]

print(df)

  firstName lastName               Close_Matches
0       Ali   Khaled                         Ali
1    Hamada     5ald              Hamada, 7amada
2       3ly    7mada                         3ly
3    7amada    5aled              7amada, Hamada
4    Sophia   Andrew              Sophia, Sofiya
5    Sofiya    Jaxon              Sofiya, Sophia
6  Matthieu  Jackson  Matthieu, Matthieu, Mathew
7  Matthieu   Jozeph  Matthieu, Matthieu, Mathew
8    Mathew    Andru  Mathew, Matthieu, Matthieu

答案 1 :(得分:0)

问题在于,“同名拼写不同”的概念取决于语音。人们通过听两个名字的发音并说“嘿,这些听起来一样”来确定这一点。计算机唯一可能知道“ Matthew”和“ Matthieu”是“同名”的唯一方法是将某种类型的文本语音转换为音频分析。

由于这很可能不是您想要执行的操作,因此您真正要看的唯一一件事就是汉明距离,并定义一些阈值(也许是1个字符),您可以将其称为“同名”。这很可能是get_close_matches()所做的,但是将其作为与单词长度的比率进行评分。但是,即使那样也会产生误报(即使我现在想不起来,肯定会有汉明距离为1的不同名称),并且直到您将您正确地分组为“ Haley”和“ Hayleigh”之类的名称将该阈值提高到4,然后您将有很多误报。

更不用说名称不需要任何语音发音了。我可以给儿子取一个名字“ a”,然后给其发音“ Jared”。您怎么可能发现这是“杰罗德”的替代拼写?您不能,因此不能以编程方式确定两个名称是否“相同”。问题在于问题本身定义不明确。您最好说出您想将“在语音上相同”的名称组合在一起来定义它。这样一来,您就可以跳过人为设计的示例,例如“ a”,但您只是将该问题换成了对某种语音引擎的需求,而这并非易事。

tl; dr不可能

答案 2 :(得分:0)

要查找两个单词\句子之间的相似性,您可能需要使用诸如 Edit Distance Jaccard Distance 之类的东西。

让我们使用 Edit Distance 来进行测试:

firstName = ['Ali', 'Hamada', '3ly', '7amada', 'Sophia', 'Sofiya', 'Matthieu', 'Matthieu', 'Mathew']

#No need to implement the distance function, you can call it from NLTK

import nltk

# Find similier first name using edit distance
for name in firstName:
    nameToCompare = [x for x in firstName if x != name]
    for n in nameToCompare:
        print(name, n, nltk.edit_distance(name, n))
    print('***************')

# Ali Hamada 6
# Ali 3ly 2
# Ali 7amada 6
# Ali Sophia 5
# Ali Sofiya 5
# Ali Matthieu 7
# Ali Matthieu 7
# Ali Mathew 6
#***************
# Hamada Ali 6
# Hamada 3ly 6
# Hamada 7amada 1
# Hamada Sophia 5
# Hamada Sofiya 5
# Hamada Matthieu 7
# Hamada Matthieu 7
# Hamada Mathew 5
#***************
# 3ly Ali 2
# 3ly Hamada 6
# 3ly 7amada 6
# 3ly Sophia 6
# 3ly Sofiya 5
# 3ly Matthieu 8
# 3ly Matthieu 8
# 3ly Mathew 6
#***************
# 7amada Ali 6
# 7amada Hamada 1
# 7amada 3ly 6
# 7amada Sophia 5
# 7amada Sofiya 5
# 7amada Matthieu 7
# 7amada Matthieu 7
# 7amada Mathew 5
#***************
# Sophia Ali 5
# Sophia Hamada 5
# Sophia 3ly 6
# Sophia 7amada 5
# Sophia Sofiya 3
# Sophia Matthieu 6
# Sophia Matthieu 6
# Sophia Mathew 5
#***************
# Sofiya Ali 5
# Sofiya Hamada 5
# Sofiya 3ly 5
# Sofiya 7amada 5
# Sofiya Sophia 3
# Sofiya Matthieu 7
# Sofiya Matthieu 7
# Sofiya Mathew 6
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Mathew Ali 6
# Mathew Hamada 5
# Mathew 3ly 6
# Mathew 7amada 5
# Mathew Sophia 5
# Mathew Sofiya 6
# Mathew Matthieu 3
# Mathew Matthieu 3
#***************

小数字表示它更相似。您会注意到,它可以识别具有不同拼写的相似鬃毛。

现在,我们应用抽卡距离

for name in firstName:
    nameToCompare = [x for x in firstName if x != name]
    for n in nameToCompare:
        print(name, n, (1-nltk.jaccard_distance(set(name), set(n)))*100)
    print('***************')

# Ali Hamada 0.0
# Ali 3ly 19.999999999999996
# Ali 7amada 0.0
# Ali Sophia 12.5
# Ali Sofiya 12.5
# Ali Matthieu 11.111111111111116
# Ali Matthieu 11.111111111111116
# Ali Mathew 0.0
#***************
# Hamada Ali 0.0
# Hamada 3ly 0.0
# Hamada 7amada 60.0
# Hamada Sophia 11.111111111111116
# Hamada Sofiya 11.111111111111116
# Hamada Matthieu 9.999999999999998
# Hamada Matthieu 9.999999999999998
# Hamada Mathew 11.111111111111116
#***************
# 3ly Ali 19.999999999999996
# 3ly Hamada 0.0
# 3ly 7amada 0.0
# 3ly Sophia 0.0
# 3ly Sofiya 12.5
# 3ly Matthieu 0.0
# 3ly Matthieu 0.0
# 3ly Mathew 0.0
#***************
# 7amada Ali 0.0
# 7amada Hamada 60.0
# 7amada 3ly 0.0
# 7amada Sophia 11.111111111111116
# 7amada Sofiya 11.111111111111116
# 7amada Matthieu 9.999999999999998
# 7amada Matthieu 9.999999999999998
# 7amada Mathew 11.111111111111116
#***************
# Sophia Ali 12.5
# Sophia Hamada 11.111111111111116
# Sophia 3ly 0.0
# Sophia 7amada 11.111111111111116
# Sophia Sofiya 50.0
# Sophia Matthieu 30.000000000000004
# Sophia Matthieu 30.000000000000004
# Sophia Mathew 19.999999999999996
#***************
# Sofiya Ali 12.5
# Sofiya Hamada 11.111111111111116
# Sofiya 3ly 12.5
# Sofiya 7amada 11.111111111111116
# Sofiya Sophia 50.0
# Sofiya Matthieu 18.181818181818176
# Sofiya Matthieu 18.181818181818176
# Sofiya Mathew 9.090909090909093
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Mathew Ali 0.0
# Mathew Hamada 11.111111111111116
# Mathew 3ly 0.0
# Mathew 7amada 11.111111111111116
# Mathew Sophia 19.999999999999996
# Mathew Sofiya 9.090909090909093
# Mathew Matthieu 62.5
# Mathew Matthieu 62.5
#***************

我们也取得了不错的成绩!

希望获得帮助