比较熊猫中两列的字符串值并找到向量余弦

时间:2019-11-18 20:14:43

标签: python string pandas vector comparison

您好,我在这里找到了一些代码,可以将两个字符串转换为向量,然后进行比较以返回相似度的余弦值

import re, math
from collections import Counter
import sys
import pandas as pd

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

text1 = 'string one'
text2 = 'string two'

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

print get_cosine(vector1 , vector2)

我正在尝试将此方法应用于pandas数据框,以便代替使用两个随机字符串,它遍历每一行并将每一行的字符串值转换为向量,然后返回带有结果的新列。上面的示例返回0.5,因为字符串和字符串匹配,但是两个和一个不匹配,这意味着1/2 = 0.5个单词匹配。我有两列df['Address 1']df['Address 2'],每列中都有字符串地址值,我想进行比较并获取它们相似度的余弦值,并将此值作为新列df['Address Cosine']返回

例如,如果df['Address 1']持有'685 EASY STREET', '122 FOURTH AVE', '9189 FIFTY NINTH ST'而df ['Address 2']持有'685 EASY STREET', '240 FOURTH AVE', '9189 THIRTY EIGHTH ST' 那么我希望df['Address Cosine']包含'1.0', '0.66', '0.5'

有什么想法吗?

0 个答案:

没有答案