我构建了类似于this tutorial的Keras word2vec skipgram模型。不幸的是,损失从时代到时代不断增加,我找不到它的原因。
from keras.optimizers import rmsprop
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input
from keras.layers.merge import Dot
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing import sequence
opt = rmsprop(lr=0.001)
embedding = Embedding(vocab_size, emb_size, input_length=1,
name='embedding')
w_inputs = Input(shape=(1,), dtype='float32')
w = embedding(w_inputs)
w = Reshape((emb_size, 1))(w)
c_inputs = Input(shape=(1,), dtype='float32')
c = embedding(c_inputs)
c = Reshape((emb_size, 1))(c)
o = Dot(axes=1)([w, c])
o = Reshape((1,), input_shape=(1, 1))(o)
o = Activation('sigmoid')(o)
model = Model(inputs=[w_inputs, c_inputs], outputs=o)
model.compile(loss='binary_crossentropy', optimizer=opt)
由于语料库由独立句子组成,因此训练程序的结构使得一个接一个的句子用于训练:
sampling_table = sequence.make_sampling_table(vocab_size)
epochs = 200
count = 0
for i in range(epochs):
loss = 0.
for j, doc in enumerate(mapping):
count += 1
data, labels = skipgrams(sequence=doc,
vocabulary_size=vocab_size,
window_size=5,
sampling_table=sampling_table)
x = [np.array(x) for x in zip(*data)]
y = np.array(labels, dtype=np.int32)
if x:
prev_loss = loss
loss += sg.train_on_batch(x, y)
if count % 10000 == 0:
print(str(count)+' / '+str(len(mapping)))
变量"映射"是一个单词索引列表(句子)列表,包含约370,000个句子。每个句子至少包含2个单词,但是,我所面临的问题似乎与句子中的单词数无关。