如何合并近似字符串?

时间:2019-02-02 14:51:33

标签: python pandas

我想将大约国家/地区名称上的2个数据框与合并时合并,但出现以下错误:

TypeError:“ NoneType”对象不可调用

请参见下面的说明代码:

cl =  {'Country' : ["Brazil", "US", "Russia"], 'BL?':['No', 'No','Yes']}
clist = pd.DataFrame.from_dict(cl)

cd = {'Country' : ["Braizl", "us", "Rusia"]}
cdata  = pd.DataFrame.from_dict(cd)

clist = clist.sort_values('Country')
cdata = cdata.sort_values('Country')


cdata = pd.merge_asof(cdata,clist,on='Country')  

预期结果将合并两个df,而cdata df将具有'BL?'列为是/否。

提前谢谢!

2 个答案:

答案 0 :(得分:3)

这应该使您接近,但不会100%准确。您可以使用fuzzywuzzyfuzzywuzzy使用Levenshtein距离来计算两个字符串之间的差异:

from fuzzywuzzy import process

# create a choice list
choices = clist['Country'].values.tolist()

# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices)[0])

# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')


  Country_x Close Country Country_y  BL?
0    Braizl        Brazil    Brazil   No
1     Rusia        Russia    Russia  Yes
2        us            US        US   No

您甚至可以返回百分比匹配,并且如果您只想保持匹配度大于85%,则仅保留值> n

添加百分之匹配

from fuzzywuzzy import process

# create a choice list
choices = clist['Country'].values.tolist()

# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices))

# add percent match wiht apply
cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)

# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')

  Country_x Close Country  Percent Match Country_y  BL?
0    Braizl        Brazil             83    Brazil   No
1     Rusia        Russia             91    Russia  Yes
2        us            US            100        US   No

您可以在合并之前执行布尔索引,以删除不匹配项然后合并:

cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)
cdata = cdata[cdata['Percent Match']>85]

或者您可以在合并后执行此操作:

merge = cdata.merge(clist, left_on='Close Country', right_on='Country')
merge[merge['Percent Match'] > 85]

fuzzywuzzy作为process函数的一部分返回匹配百分比。在第一个示例中,我通过调用元组的第一个元素将其删除:process.extractOne(x, choices)[0]

答案 1 :(得分:1)

鉴于您的示例,我提出了解决方案。这不是很pythonic,但是可以用! (假设你有在匹配国家名称CREATE TRIGGER Adding_Default_Date ON students AFTER INSERT AS Begin UPDATE s SET my_column = ... FROM student AS s JOIN inserted AS i ON i.key = s.key End 对于每个clist拼写错误国家)

cdata

输出:

def get_closest(x, column):
    tmp = 1000
    for i2, r2 in clist.iterrows():
        levenshtein = editdistance.eval(x,r2['Country'])
        if levenshtein <= tmp:
            tmp = levenshtein
            res = r2

    return res['BL?']

cdata['BL'] = cdata['Country'].apply(lambda x: get_closest(x, clist))

我正在使用editdistance库来计算levenshtein距离。 您可以使用pip安装它:

   Country   BL
0  Braizl   No
1      us   No
2   Rusia  Yes