Question

我使用Wikipedia进行Doc2Vec模型培训。没有足够的内存来一次训练模型。因为，当我尝试用所有句子构建词汇表时，我的python会中断。

所以，我想把这个过程分成几部分。我选择了几个文档，训练模型，保存模型，打开旧模型并尝试用新的句子\标签更新它。

我的第一次培训代码

model = gensim.models.Doc2Vec(min_count=5, window=10, size=300, sample=1e-3, negative=5, workers=3)

model.build_vocab(sentences.to_array())

sentences_list=sentences.to_array() 
Idx=range(len(sentences_list))

for epoch in range(10):
    random.shuffle(Idx)
    perm_sentences = [sentences_list[i] for i in Idx]
    model.train(perm_sentences)

model.save('example')

此代码非常完美。之后我做了

model = Doc2Vec.load('example')

sentences_list_new=sentences_new.to_array() 
Idx=range(len(sentences_list_new))

for epoch in range(10):
    random.shuffle(Idx)
    perm_sentences_new = [sentences_list_new[i] for i in Idx]
    model.train(perm_sentences_new)

但我收到警告：

WARNING:gensim.models.word2vec:supplied example count (9999) did not equal expected count (133662)

新单词不会添加到模型中。

然后我尝试用新词构建词汇：

model.build_vocab(sentences_list_new)

但是有这个错误：

RuntimeError: must sort before initializing vectors/weights

但......在这个新单词出现在词汇表之后。

问题出在哪里？

Answer 1

来自戈登莫尔的回答here：

目前，该模型仅使用一次发现词汇 }不再受支持。

根据sebastien-j中的this discussion：

内存使用量应约为8 *大小* | V |字节（加   一些开销）。

对于| V | = 10 ^ 7且size = 500，这是40 GB。

查看你的系统是否有足够的内存（如果它有可能有一个   python版本问题，在你的情况下不太可能......）

如果没有，您可以尝试增加build_vocab()

如何用新句子更新Doc2Vec模型？

1 个答案: