使用build_vocab_from_freq()

时间:2018-06-05 12:28:17

标签: python-2.7 gensim

我在使用build_vocab_from_file()函数训练的gensim中有一个Doc2Vec模型。这样我就可以在索引0处手动包含<PAD>令牌。此令牌不会出现在原始数据集中,但需要在我的程序中进一步显示。

步骤/代码/语料库重现

以下是我想要实现的一个简单示例:

import collections, sys

import gensim
from gensim import models
from gensim.models.doc2vec import TaggedDocument

lines = [u'It is a truth universally acknowledged',  
        u'This was invitation enough.',  
        u'An invitation to dinner was soon afterwards dispatched']  
words = [line.split() for line in lines]  
doc_labels = [u'text0', u'tex1', u'text2']  
word_freq = collections.Counter([w for line in words for w in line])  
word_freq['<PAD>'] = sys.maxint # this ensure that the pad token has index 0 in gensim's vocabulary  

class DocIterator(object):  
    def __init__(self, docs, labels):  
        self.docs = docs  
        self.labels = labels  
    def __iter__(self):  
        for idx, doc in enumerate(self.docs):  
            yield TaggedDocument(words=doc, tags=[self.labels[idx]])  

doc_it = DocIterator(words, doc_labels)  
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)  
model.build_vocab_from_freq(word_freq)  
model.train(doc_it, total_examples=len(lines), epochs=10)  

model.docvecs.count的预期大小为3(不是0)。 model.docvecs.count的实际尺寸为0

print(model.docvecs.count) - &gt; 0

版本
Linux-3.19.0-82-generic-x86_64-with-Ubuntu-15.04-vivid
('Python', '2.7.9 (default, Apr  2 2015, 15:33:21) \n[GCC 4.9.2]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

现在我的问题是: - 使用build_vocab_from_freq()获取有效模型的正确方法是什么? - 为此,强制gensim在词汇表中的特定索引值中包含一个看不见的标记的最佳方法是什么?

0 个答案:

没有答案