我在使用build_vocab_from_file()
函数训练的gensim中有一个Doc2Vec模型。这样我就可以在索引0处手动包含<PAD>
令牌。此令牌不会出现在原始数据集中,但需要在我的程序中进一步显示。
以下是我想要实现的一个简单示例:
import collections, sys
import gensim
from gensim import models
from gensim.models.doc2vec import TaggedDocument
lines = [u'It is a truth universally acknowledged',
u'This was invitation enough.',
u'An invitation to dinner was soon afterwards dispatched']
words = [line.split() for line in lines]
doc_labels = [u'text0', u'tex1', u'text2']
word_freq = collections.Counter([w for line in words for w in line])
word_freq['<PAD>'] = sys.maxint # this ensure that the pad token has index 0 in gensim's vocabulary
class DocIterator(object):
def __init__(self, docs, labels):
self.docs = docs
self.labels = labels
def __iter__(self):
for idx, doc in enumerate(self.docs):
yield TaggedDocument(words=doc, tags=[self.labels[idx]])
doc_it = DocIterator(words, doc_labels)
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)
model.build_vocab_from_freq(word_freq)
model.train(doc_it, total_examples=len(lines), epochs=10)
model.docvecs.count
的预期大小为3(不是0)。
model.docvecs.count
的实际尺寸为0
print(model.docvecs.count)
- &gt; 0
Linux-3.19.0-82-generic-x86_64-with-Ubuntu-15.04-vivid
('Python', '2.7.9 (default, Apr 2 2015, 15:33:21) \n[GCC 4.9.2]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)
现在我的问题是:
- 使用build_vocab_from_freq()
获取有效模型的正确方法是什么?
- 为此,强制gensim在词汇表中的特定索引值中包含一个看不见的标记的最佳方法是什么?