我有两个示例数据帧,如下所示:
df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'},
'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'},
'GPA': {0: 3, 1: 3.5, 2: 4}})
我想使用模糊匹配方法根据两列“名称”和“度”将它们合并在一起,以排除可能的重复项。这是我从这里的参考资料中得到的帮助: Apply fuzzy matching across a dataframe column and save results in a new column
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Name'],
df2['Name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
compare.apply(metrics).unstack().idxmax().unstack(0)
compare.apply(metrics).unstack(0).idxmax().unstack(0)
让我们说一个人的名字和学位的fuzz.ratio都高于80,我们认为他们是同一个人。并将df1中的Name和Degree作为默认值。如何获得以下预期结果?谢谢。
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
Name Degree Age GPA duplicatedName duplicatedDegree
0 John Masters 27.0 3.0 John S. Master
1 Bob Graduate 23.0 3.5 Bob K. Graduated
2 Shiela Graduate 21.0 NaN NaN Graduated
3 Frank Graduated NaN 4.0 NaN Graduate
答案 0 :(得分:2)
对于我工作60
,我认为比率应该更低。用Series
创建list comprehension
,用N
过滤并获得最大值。最后map
与fillna
和最后merge
:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {tup: fuzz.ratio(*tup) for tup in
product(df1['Name'].tolist(), df2['Name'].tolist())}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
print (s1)
John S. John
Bob K. Bob
dtype: object
degrees = {tup: fuzz.ratio(*tup) for tup in
product(df1['Degree'].tolist(), df2['Degree'].tolist())}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated Graduate
Master Masters
dtype: object
df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
Name Degree Age GPA
0 John Masters 27.0 3.0
1 Bob Graduate 23.0 3.5
2 Shiela Graduate 21.0 NaN
3 Frank Graduate NaN 4.0