我有一个包含2列的数据集 - 每列包含一组文档。我必须将Col A中的文档与Col B中提供的文档相匹配。这是一个受监督的分类问题。所以我的训练数据包含一个标签栏,表明文件是否匹配。
为了解决这个问题,我创建了一组功能,例如f1-f25(通过比较2个文档),然后在这些功能上训练了二元分类器。这种方法运行得相当好,但现在我想评估这个问题的深度学习模型(特别是LSTM模型)。
我在Python中使用keras
库。在浏览了keras文档和在线提供的其他教程后,我成功完成了以下任务:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
y = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
lstm_out_x1 = LSTM(32)(x)
lstm_out_x2 = LSTM(32)(y)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
# generate intermediate output
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
auxiliary_input = Input(shape=(25,), name='aux_input')
# merge aux output with aux input and stack dense layer on top
main_input = keras.layers.concatenate([auxiliary_output, auxiliary_input])
x = Dense(64, activation='relu')(main_input)
x = Dense(64, activation='relu')(x)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
model = Model(inputs=[main_input1, main_input2, auxiliary_input], outputs= main_output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit([x1, x2,aux_input], y,
epochs=3, batch_size=32)
但是,当我在训练数据上得分时,我会得到同样的概率。所有案件的得分。问题似乎与辅助输入的输入方式有关(因为当我删除辅助输入时它会产生有意义的输出)。 我还尝试在网络中的不同位置插入辅助输入。但不知怎的,我无法让它发挥作用。
任何指针?
答案 0 :(得分:0)
嗯,这已经持续了几个月,人们正在投票 我最近使用this dataset做了一些非常相似的事情,可用于预测信用卡默认值,它包含客户的分类数据(性别,教育程度,婚姻状况等)以及作为时间序列的付款历史。所以我不得不将时间序列与非系列数据合并。通过将LSTM与密集组合,我的解决方案与您的解决方案非常相似,我尝试采用您的问题的方法。对我有用的是辅助输入上的密集层。
此外,在您的情况下,共享层是有意义的,因此相同的权重用于“读取”两个文档。我建议您测试数据:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x1 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
x2 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
# Comment Manngo: Here I changed to shared layer
# Also renamed y as input as it was confusing
# Now x and y are x1 and x2
lstm_reader = LSTM(32)
lstm_out_x1 = lstm_reader(x1)
lstm_out_x2 = lstm_reader(x2)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(x)
# generate intermediate output
# Comment Manngo: This is created as a dead-end
# It will not be used as an input of any layers below
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
# Comment Manngo: Dense branch on the comparison features
auxiliary_input = Input(shape=(25,), name='aux_input')
auxiliary_input = Dense(64, activation='relu')(auxiliary_input)
auxiliary_input = Dense(32, activation='relu')(auxiliary_input)
# OLD: merge aux output with aux input and stack dense layer on top
# Comment Manngo: actually this is merging the aux output preparation dense with the aux input processing dense
main_input = keras.layers.concatenate([x, auxiliary_input])
main = Dense(64, activation='relu')(main_input)
main = Dense(64, activation='relu')(main)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(main)
# Compile
# Comment Manngo: also define weighting of outputs, main as 1, auxiliary as 0.5
model.compile(optimizer=adam,
loss={'main_output': 'w_binary_crossentropy', 'aux_output': 'binary_crossentropy'},
loss_weights={'main_output': 1.,'auxiliary_output': 0.5},
metrics=['accuracy'])
# Train model on main_output and on auxiliary_output as a support
# Comment Manngo: Unknown information marked with placeholders ____
# We have 3 inputs: x1 and x2: the 2 strings
# aux_in: the 25 features
# We have 2 outputs: main and auxiliary; both have the same targets -> (binary)y
model.fit({'main_input1': __x1__, 'main_input2': __x2__, 'auxiliary_input' : __aux_in__}, {'main_output': __y__, 'auxiliary_output': __y__},
epochs=1000,
batch_size=__,
validation_split=0.1,
callbacks=[____])
我不知道这有多大帮助,因为我没有您的数据所以我无法尝试。不过这是我最好的拍摄 由于显而易见的原因,我没有运行上面的代码。
答案 1 :(得分:0)
我从 https://datascience.stackexchange.com/questions/17099/adding-features-to-time-series-model-lstm 处找到了答案。菲利普雷米先生编写了一个库来调节辅助输入。我使用了他的图书馆,非常有帮助。
# 10 stations
# 365 days
# 3 continuous variables A and B => C is target.
# 2 conditions dim=5 and dim=1. First cond is one-hot. Second is continuous.
import numpy as np
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from cond_rnn import ConditionalRNN
stations = 10 # 10 stations.
time_steps = 365 # 365 days.
continuous_variables_per_station = 3 # A,B,C where C is the target.
condition_variables_per_station = 2 # 2 variables of dim 5 and 1.
condition_dim_1 = 5
condition_dim_2 = 1
np.random.seed(123)
continuous_data = np.random.uniform(size=(stations, time_steps, continuous_variables_per_station))
condition_data_1 = np.zeros(shape=(stations, condition_dim_1))
condition_data_1[:, 0] = 1 # dummy.
condition_data_2 = np.random.uniform(size=(stations, condition_dim_2))
window = 50 # we split series in 50 days (look-back window)
x, y, c1, c2 = [], [], [], []
for i in range(window, continuous_data.shape[1]):
x.append(continuous_data[:, i - window:i])
y.append(continuous_data[:, i])
c1.append(condition_data_1) # just replicate.
c2.append(condition_data_2) # just replicate.
# now we have (batch_dim, station_dim, time_steps, input_dim).
x = np.array(x)
y = np.array(y)
c1 = np.array(c1)
c2 = np.array(c2)
print(x.shape, y.shape, c1.shape, c2.shape)
# let's collapse the station_dim in the batch_dim.
x = np.reshape(x, [-1, window, x.shape[-1]])
y = np.reshape(y, [-1, y.shape[-1]])
c1 = np.reshape(c1, [-1, c1.shape[-1]])
c2 = np.reshape(c2, [-1, c2.shape[-1]])
print(x.shape, y.shape, c1.shape, c2.shape)
model = Sequential(layers=[
ConditionalRNN(10, cell='GRU'), # num_cells = 10
Dense(units=1, activation='linear') # regression problem.
])
model.compile(optimizer='adam', loss='mse')
model.fit(x=[x, c1, c2], y=y, epochs=2, validation_split=0.2)