计算两个数据帧在熊猫中的相似性得分:

时间:2019-06-24 19:52:12

标签: python-3.x pandas levenshtein-distance

我有两个数据帧

df1 = 

Id1   |city        |state       |country    
d1    |Hyd     |Telangana       |India    
d2    |Banglore    |Karnataka       |India   
d3    | Mysore     |karnataka       |India    


df2 = 

Id2  city      state       country     
b1   Hyd    Telangana     India    
b2   Banglore   Karnataka   India    
Output:

Id1    Id2   similarity_score    
d1     b1          100   
d1     b2          33.33    
d2     b1          33.33   
d2     b2          100   
d3     b1          33.33    
d3     b2          66.66   

在这里相似度得分只是一个近似值,可能要比较三列以上。

我尝试使用Levenshtein函数

memo = {}   

def levenshtein(s, t): 

    if s == "":
        return len(t)
    if t == "":
        return len(s)
    cost = 0 if s[-1] == t[-1] else 1
    i1 = (s[:-1], t)
    if not i1 in memo:
        memo[i1] = levenshtein(*i1)
    i2 = (s, t[:-1])
    if not i2 in memo:
        memo[i2] = levenshtein(*i2)
    i3 = (s[:-1], t[:-1])
    if not i3 in memo:
        memo[i3] = levenshtein(*i3)
    res = min([memo[i1]+1, memo[i2]+1, memo[i3]+cost])
    return res

scores = [] 

for index, row in stringData.iterrows():  

    df = pd.DataFrame(columns = ['city','state','country'])
    for innerIndex, innerRow in stringData.iterrows():
        if(row['id1'] != innerRow['id2']):
            df = df.append({'id1': row['id1'], 'id2': innerRow['id2'], 'SimilarityScore': levenshtein(row['city'], innerRow['city'])
                            +levenshtein(str(row['state']), str(innerRow['state']))
                            +levenshtein(str(row['country']), str(innerRow['country']))}, ignore_index = True)
    fileName = 'score' + str(row['id1']) + '.csv'

此函数匹配列中的每个字母并给我相似度,但是我想比较列并返回1或0并计算相似度。

0 个答案:

没有答案