python3中的余弦相似度

时间:2018-09-17 12:08:54

标签: python-3.x

我有两个列表

我使用了词向量和余弦相似度,基于两个向量之间的余弦值来寻找相似词。

我已经定义了矢量功能和余弦相似度的单词,所以我在这里没有提及。

tar1 = ['apple','fruit', 'vegetable','school']
tar2 = ['fruit', 'apple', 'school','vegetable']

i=0
j=0
for i in range (len(tar1)):
    vect1 = text_to_vector(tar1[i].strip().lower())

    for j in range(len(keyword)):
        vect2 = text_to_vector(tar2[j].strip().lower())
        cosine = get_cosine(vect1, vect2)
        j = j+1
i = i+1

在嵌套循环中,我想在内循环运行后挑选出具有最大余弦相似度值的字符串。

例如: tar1中的第一项是“苹果” tar2中'apple'的余弦相似度很高。因此基于高余弦相似度。它必须选择单词

我正在寻找如下输出。

o / p = ['苹果','水果','蔬菜','学校']

1 个答案:

答案 0 :(得分:0)

可能的实现方式来获得您想要的(带有注释):

def text_to_vector(text):
    return text


def get_cosine(x, y):
    return 1 if x == y else 0


tar1 = ['apple', 'fruit', 'vegetable', 'school']
tar2 = ['fruit', 'apple', 'school', 'vegetable']

result = list()
# iterate over words in tar1
for dummy_idx_1, vector_1 in enumerate(text_to_vector(word) for word in tar1):
    # keep track of the maximum cosine and most similar word
    max_cosine, best_word = -1, None
    # iterate over words in tar2 for every word in tar1
    for idx_2, vector_2 in enumerate(text_to_vector(word) for word in tar2):
        # compute cosine
        cosine = get_cosine(vector_1, vector_2)
        # check if current word from tar2 is the most similar to the word from tar1
        if cosine > max_cosine:
            max_cosine, best_word = cosine, tar2[idx_2]
    # remember result for every word from tar1
    result.append(best_word)

print(result)

输出为:

['apple', 'fruit', 'vegetable', 'school']