Question

我在gensim中有一个word2vec模型，训练超过98892个文档。对于句子数组中不存在的任何给定句子（即我训练模型的集合），我需要用该句子更新模型，以便下次查询它会给出一些结果。我是这样做的：

new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)

并将其打印为日志：

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

现在，当我使用类似的new_sentence查询大多数肯定时（如model.most_similar(positive=new_sentence)）时，它会发出错误：

Traceback (most recent call last):
 File "<pyshell#220>", line 1, in <module>
 model.most_similar(positive=['moscow', 'weather', 'cold'])
 File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
 raise KeyError("word '%s' not in vocabulary" % word)
  KeyError: "word 'cold' not in vocabulary"

这表明“冷”这个词不是我训练这个词的词汇的一部分（我是对的）？

所以问题是：如何更新模型，以便它给出给定新句子的所有可能的相似性？

Answer 1

train()期望输入的句子序列，而不是一个句子。
train()仅基于现有词汇表的现有要素向量updates weights。您无法使用train()添加新词汇表（=新要素向量）。

Answer 2

从gensim 0.13.3开始，可以使用gensim对Word2Vec进行在线培训。

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)

Answer 3

如果您的模型是使用C工具load_word2vec_format生成的，则无法更新该模型。请参阅有关在线培训{word 3}}的word2vec教程部分：

请注意，无法使用生成的模型恢复培训通过C工具，load_word2vec_format（）。你仍然可以使用它们查询/相似性，但对培训至关重要的信息（词汇树）在那里失踪。

Answer 4

首先，您不能将新单词添加到预先训练过的模型中。

然而，2014年发布的“新”doc2vec模型符合您的所有要求。您可以使用它来训练文档向量，而不是获取一组单词向量然后组合它们。最好的部分是doc2vec可以在训练后推断出看不见的句子。虽然模型仍然是不可更改的，但您可以根据我的实验获得非常好的推理结果。

Answer 5

问题是你无法用新的句子重新训练word2vec模型。只有doc2vec允许它。试试doc2vec模型。

Answer 6

您可以添加到模型词汇表，并使用FastText添加到嵌入。

public int Compare(int x, int y) { return 1; }

Here，您可以看到一些FastText示例。 Here，您将看到如何使用FastText为语音（OOV）实例评分。

更新gensim word2vec模型

6 个答案: