Question

根据原始文件Distributed Representations of Sentences and Documents，对看不见的段落的推断可以通过

完成。

训练“推理阶段”以获得新的段落向量 D 段落（从未见过）通过添加更多列在 D 中，渐变在 D 下降，同时按住 W，U，b 固定

此推理阶段可以通过infer_vector()在gensim中完成。如果我有window = 5用于doc2vec模型，并尝试推断其句子为len(sentence) < 5的段落。

如：

model = Doc2Vec(window=5) paragraph = [['I', 'am', 'groot'], ['I', 'am', 'groot', 'I', 'am', 'groot']] model.infer_vector(paragraph)

在这种情况下，我应该使用特殊的NULL字符号预先填充我的推断向量，以便段落中所有句子的长度都应该大于窗口大小吗？

如：

paragraph = [['I', 'am', 'groot', NULL, NULL], ['I', 'am', 'groot', 'I', 'am', 'groot']]

Answer 1

您永远不需要做任何明确的填充。

在默认和常见的Doc2Vec模式下，如果焦点字两侧的上下文不足，则有效window会在该侧缩小以匹配可用内容。

（在非默认dm=1, dm_concat=1模式下，必要时会有自动填充。但是这种模式会导致更大，更慢的模型需要更多的数据进行训练，其值不是{＆n;在任何经过验证的设置中都非常清楚。除了拥有大量数据并且能够修改非默认参数的高级用户之外，该模式不太可能获得良好的结果。）

Answer 2

我发现gensim会在训练和推断阶段自动预填文件。

gensim.models.doc2vec.train_document_dm_concat

    null_word = model.vocab['\0']
    pre_pad_count = model.window
    post_pad_count = model.window
    padded_document_indexes = (
        (pre_pad_count * [null_word.index])  # pre-padding
        + [word.index for word in word_vocabs if word is not None]  # elide out-of-Vocabulary words
        + (post_pad_count * [null_word.index])  # post-padding
    )

doc2vec（gensim）infer_vector需要窗口大小的填充句子吗？

2 个答案: