Python - 有效地找到n个最近的向量

时间:2018-04-06 17:07:58

标签: python vector similarity cosine-similarity word-embedding

我试图编写一个Python方法,根据各自的嵌入向量,有效地将n个最接近的单词返回给定单词。每个向量是200维度,其中有几百万个。

这就是我现在拥有的东西,它只是对目标词和其他每个词进行余弦相似性比较。这非常非常缓慢:

def n_nearest_words(word, n, word_vectors):
    """
    Return a list of the n nearest words to param word, based on cosine similarity
    param word_vectors: dict, keys are words and values are vectors
    """
    # get_word_vector() finds the word in the word_vectors dict, using a number of
    # possible capitalizations. Returns None if not found
    word_vector = get_word_vector(word, word_vectors)
    if word_vector:
        word_vector = word_vector.reshape((1, -1))
        sorted_by_sim = sorted(
            word_vectors.keys(),
            key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
            reverse=True)
        return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
    return list()

有人有更好的建议吗?

1 个答案:

答案 0 :(得分:1)

也许尝试在两个单词的dict中存储两个单词之间的距离,这样你就可以在看过一次单词后查找单词。