我想要实现一个简单的word2vec模型,但是我收到以下错误
ValueError: Error when checking target: expected dense-softmax to have 3 dimensions, but got array with shape (32, 14).
变量train_x
和train_y
是32行
[[0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0]
...]]
,python代码如下
vocal_size = 14
input = Input(shape=(vocal_size, ), dtype='int32', name='input')
embeddings = Embedding(output_dim=5, input_dim= vocal_size)(input)
output = Dense(vocal_size, use_bias=False, activation='softmax')(embeddings)
model = Model(input=input, output=output)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
model.fit(train_x, train_y)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 14) 0
_________________________________________________________________
embeddings (Embedding) (None, 14, 5) 70
_________________________________________________________________
dense_1 (Dense) (None, 14, 14) 70
=================================================================
Total params: 140
Trainable params: 140
Non-trainable params: 0
修改
(“我喜欢stackoverflow”)上下文大小为1,我创建了以下元组,
(“I”,“like”),(“like”,“I”),(“like”,“stackoverflow”),(“stackoverflow”,“like”)
然后我对所有这些进行一次热编码并将它们提供给模型。
train_x [0] - >是“I”字的热门编码
train_y [0] - >是上下文单词“like”的热门编码
编辑2
使用skip-gram的第一个编码:
将0作为特殊单词处理(即不是最常见的10.000)并从1开始计数。
我假设我应该输入一个数字并输出一个热门编码,即(“堆栈”,“溢出”),输入[3]
(“堆栈”)和输出[0,0,0,0,1,0,0,0,0,0,0]
(“溢出”)。
Input(shape=(1,)..) ->
Embedding(output_dim=embedding_size, input_dim=vocab_size, mask_zero=True, ...) ->
Dense(vocab_size+1, activation="Softmax")
model.compile(optimizer='SGD', loss='categorical_crossentropy')
即。 embedding_size = 5,输入你的例子中的句子,
答案 0 :(得分:0)
感谢您的编辑。你遇到麻烦有两个原因,一个浅,一个深。第一:浅,致密层需要三维输入,但嵌入是二维的。您可以使用Flatten
:
input = Input(shape=(vocal_size, ), dtype='int32', name='input')
embeddings = Embedding(output_dim=5, input_dim=vocal_size+1, input_length=vocal_size)(input)
flat = Flatten(embeddings)
output = Dense(vocal_size, use_bias=False, activation='softmax')(flat)
深度是因为单热编码和嵌入是两个用于相同目的的选项,因此您不需要两者(请参阅here和here)。
嵌入层需要一系列由表示单词(或元组)和词汇量大小的整数组成的“句子”,所以类似
['Welcome to stack overflow',
'stack overflow is great',
'Hope it's helpful to you']
将表示为
[[1,2,3,4,0],[3,4,5,6,0],[7,8,9,2,10]]
# 0s are there to "pad" sentences 1 & 2 as they all need to be the same length
并输入这样的嵌入层:
input = Input(shape=(5, ), dtype='int32')
embeddings = Embedding(output_dim=5, input_dim=11, input_length=5)(input)
#input dim is 11 because we want 1 more than the number of words in our vocabulary
#padding can be done with the keras function pad_sequences
我确定你知道,我们句子的一个热门编码看起来像这样:
[[1,1,1,1,0,0,0,0,0,0],
[0,0,1,1,1,1,0,0,0,0],
[0,1,0,0,0,0,1,1,1,1]]
因为句子已经被转换(一个热门已经将我们的句子“嵌入”作为10维空间中的二元向量),我们可以直接将其提供给Dense
层而无需进一步嵌入:< / p>
input = Input(shape=(vocal_size, ), dtype='int32', name='input')
output = Dense(vocal_size, use_bias=False, activation='softmax')(input)
这是一个使用两种方式的功能性玩具示例:
from keras.layers import Dense,Activation,Embedding,Input,Flatten
from keras import Model
import numpy as np
wrords = ['Welcome to stack overflow',
'stack overflow is great',
'Hope it\'s helpful to you']
a = [[1,2,3,4,0],[3,4,5,6,0],[7,8,9,2,10]]
b = [[1,1,1,1,0,0,0,0,0,0],
[0,0,1,1,1,1,0,0,0,0],
[0,1,0,0,0,0,1,1,1,1]]
c = [1,1,0] #hypothetical target is "references stack overflow"
input = Input(shape=(5, ), dtype='int32', name='input')
embeddings = Embedding(output_dim=5, input_dim=11, input_length=5)(input)
flat = Flatten()(embeddings)
output = Dense(1, activation='softmax')(flat)
model = Model(input=input, output=output)
model.compile(optimizer='adam', loss='binary_crossentropy')
model.summary()
model.fit(np.array(a),np.array(c))
input2 = Input(shape=(10, ), dtype='float32')
output2 = Dense(1, activation='softmax')(input2)
model2 = Model(input=input2, output=output2)
model2.compile(optimizer='adam', loss='binary_crossentropy')
model2.summary()
model2.fit(np.array(b),np.array(c))