Gensim DOC2VEC修剪和删除词汇表

时间:2018-05-28 14:59:17

标签: python gensim doc2vec vocabulary

我尝试创建一个简单的Doc2Vec模型:

 sentences = []
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarponcini', u'Emporio', u'Armani'], tags=[3]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'marca', u'italiana'], tags=[4]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'bianche', u'senza', u'tacco'], tags=[5]))

 model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
 model.build_vocab(sentences)  

但我最终得到一个空洞的词汇。通过一些调试,我发现在build_vocab()函数中,一个字典实际上是由vocabulary.scan_vocab()函数创建的,但它被以下的vocabulary.prepare_vocab()函数删除。更深刻的是,这是导致问题的功能:

def keep_vocab_item(word, count, min_count, trim_rule=None):
    """Check that should we keep `word` in vocab or remove.

    Parameters
    ----------
    word : str
        Input word.
    count : int
        Number of times that word contains in corpus.
    min_count : int
        Frequency threshold for `word`.
    trim_rule : function, optional
        Function for trimming entities from vocab, default behaviour is `vocab[w] <= min_reduce`.

    Returns
    -------
    bool
        True if `word` should stay, False otherwise.

    """
    default_res = count >= min_count

    if trim_rule is None:
        return default_res # <-- ALWAYS RETURNS FALSE
    else:
        rule_res = trim_rule(word, count, min_count)
        if rule_res == RULE_KEEP:
            return True
        elif rule_res == RULE_DISCARD:
            return False
        else:
            return default_res  

有人理解这个问题吗?

1 个答案:

答案 0 :(得分:2)

我自己找到了答案,min_count的默认值是5,而且我没有5或更多计数器的单词。 我只需要改变这行代码:

@dispatcher.message_handler(PhotoFilter())
def ask_photo(bot, update):
    user_peer = update.get_effective_user()
    bot.upload_file(file="../files/upload_file_test.jpeg", 
        file_type="file", 
        success_callback=file_upload_success,failure_callback=failure)   


def file_upload_success(response):
    photo_message = PhotoMessage(file_id=response.file_id,  
        access_hash=response.access_hash, name="photo", file_size="100",
        mime_type="image/jpeg", thumb=response.thumb, width=80, height=80,
        caption_text="caption")
    bot.send_message(photo_message, user_peer)