我有两个不同的客户数据帧,我想根据Jaccard距离矩阵或任何其他方法来匹配它们。
df1
Name country cost
0 raj Kazakhstan 23
1 sam Russia 243
2 kanan Belarus 2
3 Nan Nan 0
df2
Name country DOB
0 rak Kazakhstan 12-12-1903
1 sim russia 03-04-1994
2 raj Belarus 21-09-2003
3 kane Belarus 23-12-1999
输出:
如果字符串比较值大于> 0.6, 我想合并新数据框中的两行。
Df3
Name country Name country cost DOB
0 raj Kazakhstan rak Kazakhstan 23 12-12-1903
1 sam Russia sim russia 243 03-04-1994
2 kanan Belarus Kane Belarus 2 23-12-1999
我曾尝试对每一行进行每一行的计算,但是不比较每一行与另一行对整个行的比较吗?
答案 0 :(得分:4)
我想使用fuzzywuzzy
from fuzzywuzzy import process
df1['key'] = df1.sum(1)
df2['key'] = df2.sum(1)
def yoursource(x):
if [process.extract(x, df2.key.tolist(), limit=1)][0][0][1]>60:
return [process.extract(x, df2.key.tolist(), limit=1)][0][0][0]
else :
return 'notmatch'
df1['key'] = df1.key.apply(yoursource)
此后,我们使用merge
df = df1.merge(df2, on='key', how='inner').drop('key',1)
df
Name_x country_x Name_y country_y
0 raj Kazakhstan rak Kazakhstan
1 sam Russia sim russia
2 kanan Belarus kane Belarus