在此问题中,我尝试使用difflib和Fuzzy wuzzy来匹配名称,但是由于名称变化,匹配率很差。我现在正尝试将我拥有的其他数据字段用于名称,但是完全不确定如何处理此类问题。如果我不清楚,请告诉我,我会尽力澄清。
我有两个数据框,它们具有相似但不完全匹配的有关人的信息。我正在寻找使每个数据框的参考号与另一个数据框相匹配的参考号,这是每个人唯一的。例如,在下表中,我想知道第一个数据帧中Jimmy / James Random的引用号(因为他们是同一个人,但名字不匹配)在DF1中为1234,在DF2中为89。请注意,一个人的等级可能会发生变化,但是会同时在两个表中发生变化。每个人的参考编号,样式,ID和国籍将始终保持不变。
df1 = pd.DataFrame(columns=["Ref","Date","Name", "Rank","Nationality","Style","ID"], \
data=[["1234","20200104","Jimmy Random","General","France","Aggressive",""],\
["1333","20200104","Ian Fleming","Brigadier","England","Passive","14"],\
["1234","20191204","Jimmy Random","Major","France","","15"],\
["1000","20200404","Peter Nisbett","Corporal","","Passive","12"]])
df2 = pd.DataFrame(columns=["Ref","Date","Name", "Rank","Nationality","Style","ID"], \
data=[["89","20200104","James Random","","France","Aggressive","104"],\
["10","20200104","I. Fleming","Brigadier","England","","4"],\
["156","20200404","P. Nisbett","","Spain","Passive","5"],\
["89","20191204","James Random","Major","France","Aggressive","104"]])
非常感谢您提供的帮助。
芝士汉堡
答案 0 :(得分:1)
您基本上需要将字符串与其他分析进行比较,对吗?检查余弦相似度,通过scit-learn实现。