如果一列满足100%匹配最佳的一列,则模糊的2个列的比率

时间:2016-05-03 16:21:05

标签: python pandas fuzzywuzzy

我的数据框是

enter image description here

Matcher = df2['Account Name']

match = if df1['Billing Country'] == df2['Billing Country'] (process.extractOne(df1['Account Name'], Matcher))

以上代码无效,但我希望仅在国家/地区匹配时才进行帐户名称的模糊匹配。

2 个答案:

答案 0 :(得分:1)

这是我的建议。首先,在两个dfs上进行完整的笛卡尔连接:

df1.loc[:, 'MergeKey'] = 1 #create a mergekey
df2.loc[:, 'MergeKey'] = 1 #it is the same for both so that when you merge you get the cartesian product
#merge them to get the cartesian product (all possible combos)
merged = df1.merge(df2, on = 'MergeKey', suffixes = ['_1', '_2'])

然后,计算每个组合的模糊比:

def fuzzratio(row):
    try: #avoid errors for example on NaN's
        return fuzz.ratio(row['Billing Country_1'], row['Billing Country_2'])
    except:
        return 0. #you'll want to expiriment w/o the try/except too
merged.loc[:, 'Ratio'] = merged.apply(fuzzratio, axis = 1) #create ratio column by applying function

现在你应该有一个df,其中包含df1['Billing Country']df2['Billing Country']的所有可能组合之间的比率。在那里,只需过滤以获得比率为100%的那些:

result = merged[merged.Ratio ==1]

答案 1 :(得分:0)

我用稍微不同的方式想出来了。

首先我使用

合并
merged_file = pd.merge(df2, df1, on='Billing Country', how = 'left')

当我有所有可能的比赛时。

我应用了fuzzywuzzy'

`Reference_data= df2['Account Name']`

`Result = process.extractOne(df1, choices)`

由于上面的字符串为我想要查找的每个值提供了最接近的匹配。 后来我又添加了一个字符串来计算比率。

Result['ratio']= fuzz.ratio(Result['Account Name_x'],Result['Account Name_y'] )