Question

背景：

我使用负采样创建了一个skipgram word2vec模型。读完paper后，我发现doc2vec看起来与word2vec非常相似，但是我们不是使用目标词来预测上下文词，而是使用段落ID来预测上下文词（如果我没记错的话，段落。

我到目前为止所拥有的：

input_paragraph_id = Input((1,))
input_context = Input((1,))
embedding = Embedding(self.vocabulary_size, self.vector_dim, input_length=1, name='embedding')
target = embedding(input_paragraph_id)
target = Reshape((self.vector_dim, 1))(target)#This target in parenthesis is telling the model to connect the reshape to target
context = embedding(input_context)
context = Reshape((self.vector_dim, 1))(context)
# now perform the dot product operation to get a similarity measure
dot_product = dot([target, context], normalize=False, axes=1)
dot_product = Reshape((1,))(dot_product)
# add the sigmoid output layer
output = Dense(1, activation='sigmoid')(dot_product)
# create the primary training model
model = Model(inputs=[input_paragraph_id, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

以图片形式描述here

我的词汇量为10000 +文档数量。在我的词汇表中，我将第一个10000设置为句子中的单词，将第二个n设置为段落ID。因此，我只对单词和段落使用一个嵌入（矩阵），并使用词汇表的索引来断言要查找的正确嵌入。

我的输入将类似于skipgram，其中目标是段落ID，上下文是句子中的单词。

让我们说这句话作为语料库：

孩子们在公园里奔跑

我的x如下：

(X1,X2), y

(p01,children), 1
(p01,are), 1
(p01,running), 1
(p01,in), 1
(p01,the), 1
(p01,park), 1

我的y向量全为1，因为我没有引入任何干扰物

问题：

这对DBOW合适吗？还是我应该为单词和段落ID设置一个单独的嵌入层？（我们的W矩阵和P矩阵）如果这是不正确的，谁能指出我使用负采样实现doc2vec的方向吗？

doc2vec-负采样

0 个答案: