为什么gensim Doc2Vec给了我同一个句子的不同向量?

时间:2016-08-16 22:27:48

标签: python neural-network gensim

我正在使用来自gensim.models.doc2vec import Doc2Vec的两个相同的句子(文件)进行训练,当检查每个句子的向量时,它们完全不同。神经网络每个句子有不同的随机初始化吗?

# imports
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import Doc2Vec
from gensim import utils

# Document iteration class (turns many documents in to sentences
# each document being once sentence)
class LabeledDocs(object):
    def __init__(self, sources):
        self.sources = sources
        flipped = {}
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')

    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                # print fin.read().strip(r"\n")
                yield LabeledSentence(utils.to_unicode(fin.read()).split(),
                                      [prefix])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                #print fin, fin.read()
                self.sentences.append(
                    LabeledSentence(utils.to_unicode(fin.read()).split(),
                                    [prefix]))
        return self.sentences

# play and play3 are names of identical documents (diff gives nothing)
inp = LabeledDocs({"play":"play", "play3":"play3"})
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025,
                min_alpha=0.025, batch_words=1)
model.build_vocab(inp.to_array())
for epoch in range(10):
    model.train(inp)

# post to this model.docvecs["play"] is very different from
# model.docvecs["play3"]

这是为什么? playplay3都包含:

foot ball is a sport
played with a ball where
teams of 11 each try to
score on different goals
and play with the ball

1 个答案:

答案 0 :(得分:2)

,每个句子向量的初始化方式不同。

特别是在reset_weights方法中。随机初始化句子向量的代码是:

for i in xrange(length):
    # construct deterministic seed from index AND model seed
    seed = "%d %s" % (model.seed, self.index_to_doctag(i))
    self.doctag_syn0[i] = model.seeded_vector(seed)

在这里,您可以看到每个句子向量都是使用模型的随机种子和句子的标签进行初始化的。因此,在您的示例中playplay3会产生不同的向量,这是有道理的。

但是如果你正确地训练模型,我会期望两个向量最终彼此非常接近。