您好,我在这里找到了一些代码,可以将两个字符串转换为向量,然后进行比较以返回相似度的余弦值
import re, math
from collections import Counter
import sys
import pandas as pd
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
text1 = 'string one'
text2 = 'string two'
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
print get_cosine(vector1 , vector2)
我正在尝试将此方法应用于pandas数据框,以便代替使用两个随机字符串,它遍历每一行并将每一行的字符串值转换为向量,然后返回带有结果的新列。上面的示例返回0.5,因为字符串和字符串匹配,但是两个和一个不匹配,这意味着1/2 = 0.5个单词匹配。我有两列df['Address 1']
和df['Address 2']
,每列中都有字符串地址值,我想进行比较并获取它们相似度的余弦值,并将此值作为新列df['Address Cosine']
返回>
例如,如果df['Address 1']
持有'685 EASY STREET', '122 FOURTH AVE', '9189 FIFTY NINTH ST'
而df ['Address 2']持有'685 EASY STREET', '240 FOURTH AVE', '9189 THIRTY EIGHTH ST'
那么我希望df['Address Cosine']
包含'1.0', '0.66', '0.5'
有什么想法吗?