我试图编写一个Python方法,根据各自的嵌入向量,有效地将n个最接近的单词返回给定单词。每个向量是200维度,其中有几百万个。
这就是我现在拥有的东西,它只是对目标词和其他每个词进行余弦相似性比较。这非常非常缓慢:
def n_nearest_words(word, n, word_vectors):
"""
Return a list of the n nearest words to param word, based on cosine similarity
param word_vectors: dict, keys are words and values are vectors
"""
# get_word_vector() finds the word in the word_vectors dict, using a number of
# possible capitalizations. Returns None if not found
word_vector = get_word_vector(word, word_vectors)
if word_vector:
word_vector = word_vector.reshape((1, -1))
sorted_by_sim = sorted(
word_vectors.keys(),
key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
reverse=True)
return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
return list()
有人有更好的建议吗?
答案 0 :(得分:1)
也许尝试在两个单词的dict中存储两个单词之间的距离,这样你就可以在看过一次单词后查找单词。