df1 =
Id1 |city |state |country
d1 |Hyd |Telangana |India
d2 |Banglore |Karnataka |India
d3 | Mysore |karnataka |India
df2 =
Id2 city state country
b1 Hyd Telangana India
b2 Banglore Karnataka India
Output:
Id1 Id2 similarity_score
d1 b1 100
d1 b2 33.33
d2 b1 33.33
d2 b2 100
d3 b1 33.33
d3 b2 66.66
在这里相似度得分只是一个近似值,可能要比较三列以上。
我尝试使用Levenshtein函数
memo = {}
def levenshtein(s, t):
if s == "":
return len(t)
if t == "":
return len(s)
cost = 0 if s[-1] == t[-1] else 1
i1 = (s[:-1], t)
if not i1 in memo:
memo[i1] = levenshtein(*i1)
i2 = (s, t[:-1])
if not i2 in memo:
memo[i2] = levenshtein(*i2)
i3 = (s[:-1], t[:-1])
if not i3 in memo:
memo[i3] = levenshtein(*i3)
res = min([memo[i1]+1, memo[i2]+1, memo[i3]+cost])
return res
scores = []
for index, row in stringData.iterrows():
df = pd.DataFrame(columns = ['city','state','country'])
for innerIndex, innerRow in stringData.iterrows():
if(row['id1'] != innerRow['id2']):
df = df.append({'id1': row['id1'], 'id2': innerRow['id2'], 'SimilarityScore': levenshtein(row['city'], innerRow['city'])
+levenshtein(str(row['state']), str(innerRow['state']))
+levenshtein(str(row['country']), str(innerRow['country']))}, ignore_index = True)
fileName = 'score' + str(row['id1']) + '.csv'
此函数匹配列中的每个字母并给我相似度,但是我想比较列并返回1或0并计算相似度。