在Python 3.4上使用带有Tensorflow后端的Keras。
我尝试在我的LSTM中加入自定义功能,目前只使用Word2Vec中的字嵌入
我用Word2Vec创建了一个嵌入矩阵。这是我的LSTM的embedding_layer。
现在我的测试和训练数据集中有我想要合并到模型中的功能。
我想将这些功能合并到我的LSTM中,但我无法弄清楚如何重塑数据。
到目前为止,这是代码的相关部分:
labels = train['is_duplicate'].tolist()
ids = test['test_id'].tolist()
re_weight = True #for imbalance
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(q1 + q2 + q1_test + q2_test)
sequences_1 = tokenizer.texts_to_sequences(q1)
sequences_2 = tokenizer.texts_to_sequences(q2)
test_sequences_1 = tokenizer.texts_to_sequences(q1_test)
test_sequences_2 = tokenizer.texts_to_sequences(q2_test)
word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))
data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)
print ("Elapsed time till loading word vectors", time()-start)
test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
ids = np.array(ids)
data_1_train = np.vstack((data_1, data_2))
data_2_train = np.vstack((data_2, data_1))
train_features = np.vstack((other_features_train, other_features_train))
labels_train = np.concatenate((labels, labels))
nb_words = min(MAX_NB_WORDS, len(word_index))+1
nb_features = train_features.size
embedding_matrix = np.load('../data/embedding_matrix_lstm.npy')
print ("Elapsed time till creating embedding matrix", time()-start_wv)
print("Elapsed time", time()-start)
start_model = time()
embedding_layer = Embedding(nb_words,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)
sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)
sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
y1 = lstm_layer(embedded_sequences_2)
other_features = Input(shape=(nb_features, ))
merged = concatenate([x1, y1, other_features])
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)
merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)
preds = Dense(1, activation='sigmoid')(merged)
model = Model(inputs=[sequence_1_input, sequence_2_input, other_features],
outputs=preds)
model.compile(loss='binary_crossentropy',
optimizer='nadam',
metrics=['acc'])
hist = model.fit([data_1_train, data_2_train, other_features], labels_train,
validation_split=0.1,
epochs=200, batch_size=2048, shuffle=True,
class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])
model.load_weights(bst_model_path)
bst_val_score = min(hist.history['val_loss'])
# Predict
preds = model.predict([test_data_1, test_data_2, other_features_test], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1, other_features_test], batch_size=8192, verbose=1)
preds /= 2
other_features_train
和other_features_test
是(train_length,5)和(test_length_5)维度的numpy数组。
这是我得到的错误:
Traceback (most recent call last):
File "lstm_customfeat.py", line 188, in <module>
class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1405, in fit
batch_size=batch_size)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1307, in _standardize_user_data
_check_array_lengths(x, y, sample_weights)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 210, in _check_array_lengths
set_x = set(x_lengths)
TypeError: unhashable type: 'Dimension'
我怀疑输入数据的形状是一个问题,因为我在不同的顺序中将data_1和data_2堆叠两次。基本原理是 - 如果q1是q2的副本,则q2也是q1的副本。我不知道如何向模型表明这一点。
我不知道&#39; Dimension&#39;类型是。我只是想了解如何使用这些功能来构建嵌入字(embedding_matrix)。