我要合并2个餐厅信息数据框。
df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
'Postal Code': [12345, 54321, 12345],
'Phone':[100,200,300]})
我尝试根据模糊匹配后加上邮政编码匹配来匹配餐厅名称,但无法获得非常准确的结果。我还尝试将餐厅名称与每个数据框的邮政编码连接起来,并对连接结果进行模糊匹配,但是我认为这不是最好的方法。
有什么方法可以使两个数据帧匹配时达到100%的准确性吗?
答案 0 :(得分:2)
选中difflib.get_close_matches()。
我使用您的示例数据框进行了尝试。有帮助吗?
import pandas as pd
import difflib
df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
'Postal Code': [12345, 54321, 12345],
'Phone':[100,200,300]})
df1['key'] = df1['Restaurant_Name']+df1['Postal Code'].astype(str)
df2['key'] = df2['Restaurant_Name']+df2['Postal Code'].astype(str)
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'])[0])
df1.merge(df2, on='key', how='outer')[['Restaurant_Name_x','Restaurant_Name_y','Postal Code_x','Phone']]
输出:
Restaurant_Name_x Restaurant_Name_y Postal Code_x Phone
0 Apple apple 12345 100.0
1 Banana Banana 12345 300.0
2 Orange NaN 54321 NaN
3 apple apple 54321 200.0
4 apple1 NaN 1111 NaN
正如您所说,我确实将餐厅名称与邮政编码连接在一起,以获得唯一的组合。
答案 1 :(得分:0)
一种选择是使用某些模糊字符串匹配模块,例如fuzzywuzzy
。
安装所需的库
pip install fuzzywuzzy
pip install python-Levenshtein
现在找到名称匹配项,如下所示
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
match_level = 90
def find_details(row):
sub_df = df2[df2['Postal Code'] == row['Postal Code']].copy()
sub_df['match'] = sub_df['Restaurant_Name'].apply(lambda x: fuzz.token_sort_ratio(row['Restaurant_Name'], x))
sub_df = sub_df[sub_df['match'] >= match_level].sort_values(['match'], ascending=[False])
phone = ''
if sub_df.shape[0] > 0:
phone = sub_df['Phone'].values[0]
ret = {
'phone' : phone
}
return pd.Series(ret)
df1.merge(df1.apply(lambda row: find_details(row), axis=1), left_index=True, right_index=True)
为您提供以下输出
Restaurant_Name Postal Code phone
0 Apple 12345 100
1 Banana 12345 300
2 Orange 54321
3 apple 54321 200
4 apple1 1111