在Keras中为LSTM添加自定义功能

时间:2017-05-08 21:18:03

标签: python numpy tensorflow keras lstm

在Python 3.4上使用带有Tensorflow后端的Keras。

我尝试在我的LSTM中加入自定义功能,目前只使用Word2Vec中的字嵌入

我用Word2Vec创建了一个嵌入矩阵。这是我的LSTM的embedding_layer。

现在我的测试和训练数据集中有我想要合并到模型中的功能。

我想将这些功能合并到我的LSTM中,但我无法弄清楚如何重塑数据。

到目前为止,这是代码的相关部分:

labels = train['is_duplicate'].tolist()
ids = test['test_id'].tolist()

re_weight = True #for imbalance

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(q1 + q2 + q1_test + q2_test)

sequences_1 = tokenizer.texts_to_sequences(q1)
sequences_2 = tokenizer.texts_to_sequences(q2)

test_sequences_1 = tokenizer.texts_to_sequences(q1_test)
test_sequences_2 = tokenizer.texts_to_sequences(q2_test)

word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))
data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)

print ("Elapsed time till loading word vectors", time()-start)

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
ids = np.array(ids)

data_1_train = np.vstack((data_1, data_2))
data_2_train = np.vstack((data_2, data_1))
train_features = np.vstack((other_features_train, other_features_train))
labels_train = np.concatenate((labels, labels))

nb_words = min(MAX_NB_WORDS, len(word_index))+1

nb_features = train_features.size

embedding_matrix = np.load('../data/embedding_matrix_lstm.npy')

print ("Elapsed time till creating embedding matrix", time()-start_wv)

print("Elapsed time", time()-start)

start_model = time()

embedding_layer = Embedding(nb_words,
        EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_SEQUENCE_LENGTH,
        trainable=False)
lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)

sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)

sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
y1 = lstm_layer(embedded_sequences_2)

other_features = Input(shape=(nb_features, ))

merged = concatenate([x1, y1, other_features])
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

preds = Dense(1, activation='sigmoid')(merged)


model = Model(inputs=[sequence_1_input, sequence_2_input, other_features],
        outputs=preds)
model.compile(loss='binary_crossentropy',
        optimizer='nadam',
        metrics=['acc'])

hist = model.fit([data_1_train, data_2_train, other_features], labels_train,
        validation_split=0.1,
        epochs=200, batch_size=2048, shuffle=True,
        class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])

model.load_weights(bst_model_path)
bst_val_score = min(hist.history['val_loss'])

# Predict

preds = model.predict([test_data_1, test_data_2, other_features_test], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1, other_features_test], batch_size=8192, verbose=1)
preds /= 2

other_features_trainother_features_test是(train_length,5)和(test_length_5)维度的numpy数组。

这是我得到的错误:

Traceback (most recent call last):
  File "lstm_customfeat.py", line 188, in <module>
    class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1405, in fit
    batch_size=batch_size)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1307, in _standardize_user_data
    _check_array_lengths(x, y, sample_weights)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 210, in _check_array_lengths
    set_x = set(x_lengths)
TypeError: unhashable type: 'Dimension'

我怀疑输入数据的形状是一个问题,因为我在不同的顺序中将data_1和data_2堆叠两次。基本原理是 - 如果q1是q2的副本,则q2也是q1的副本。我不知道如何向模型表明这一点。

我不知道&#39; Dimension&#39;类型是。我只是想了解如何使用这些功能来构建嵌入字(embedding_matrix)。

0 个答案:

没有答案