Question

我已经让gensim Word2Vec实现为我计算了一些单词嵌入。据我所知，一切都非常奇妙;现在我正在聚集创建的单词vector，希望得到一些语义分组。

下一步，我想看一下每个集群中包含的单词（而不是向量）。即如果我有嵌入[x, y, z]的向量，我想找出这个向量代表的实际单词。我可以通过调用model.vocab和单词向量model.syn0来获取单词/ Vocab项目。但我找不到这些明确匹配的位置。

这比我想象的要复杂得多，我觉得我可能会错过这种明显的做法。任何帮助表示赞赏！

问题：

将单词与Word2Vec ()创建的嵌入向量进行匹配 - 我该怎么做？

我的方法：

在创建模型（下面的代码*）之后，我现在想将分配给每个单词的索引（在build_vocab()阶段期间）与输出为model.syn0的向量矩阵相匹配。因此

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True

有没有更好的方法，例如通过将矢量输入模型以匹配单词？
这是否能让我得到正确的结果？

*我创建单词向量的代码：

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)

model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

我发现this question虽然相似但却没有得到回答。

Answer 1

我一直在寻找syn0矩阵和词汇表之间的映射...这里的答案是：使用model.index2word这只是正确顺序的单词列表！

这不在官方文档中（为什么？）但可以直接在源代码中找到：https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

Answer 2

所以我找到了一种简单的方法，其中nmodel是模型的名称。

#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)

#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]

这是基于gensim code。对于旧版本的gensim，您可能需要在模型后删除wv。

Answer 3

如果您只想将字映射到向量，则可以使用[]运算符，例如model["hello"]将为您提供与hello相对应的向量。

如果你需要从矢量中恢复一个单词，你可以按照你的建议循环遍历矢量列表并检查匹配。然而，这是低效的而不是pythonic。一个方便的解决方案是使用word2vec模型的similar_by_vector方法，如下所示：

import gensim

documents = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)

输出：

[('survey', 1.0000001192092896)]

其中数字表示相似性。

然而，这种方法仍然效率低下，因为它仍然必须扫描所有单词向量以搜索最相似的单词向量。解决问题的最佳方法是在群集过程中找到跟踪向量的方法，这样您就不必依赖昂贵的反向映射。

Answer 4

正如@bpachev所提到的，gensim确实有按矢量搜索的选项，即similar_by_vector。

然而，它实现了强力线性搜索，即计算给定向量与词汇表中所有单词的向量之间的余弦相似性，并给出顶级邻居。另一个选项，如另一个answer中提到的那样，是使用近似最近邻搜索算法，如FLANN。

分享一个证明相同的要点： https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

在gensim中匹配单词和向量Word2Vec模型

问题：

我的方法：

4 个答案: