我正在尝试在Keras中构建编码器 - 解码器架构,用于训练(多对多)序列模型。
我正在寻找的是一个模型,它可以提供一些可变长度的文本,例如一篇新闻文章,并输出该文本的摘要(也是可变长度)。
我的模型看起来像这样:
# Encoder
model.add(Embedding(NUM_WORDS + 1, EMBEDDING_SIZE, input_length=MAX_SEQUENCE_LENGTH, mask_zero=True))
model.add(LSTM(HIDDEN_SIZE))
model.add(RepeatVector(MAX_SEQUENCE_LENGTH))
# Decoder
for _ in range(NUM_LSTM_LAYERS):
model.add(LSTM(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(MAX_SEQUENCE_LENGTH)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(X_train, Y_train, epochs=NUM_EPOCHS)
我认为这是正确的。我正在按如下方式预处理输入数据:
with open('data.csv', 'r') as csvfile:
dataset = csv.DictReader(csvfile, delimiter=';')
texts = []
X = []
Y = []
for row in dataset:
texts.append(row['data'])
texts.append(row['label'])
X.append(row['data'])
Y.append(row['label'])
tokenizer = Tokenizer(num_words=NUM_WORDS, lower=True, split=" ")
tokenizer.fit_on_texts(texts)
X_sequences = tokenizer.texts_to_sequences(X)
Y_sequences = tokenizer.texts_to_sequences(Y)
X_train = pad_sequences(X_sequences, maxlen=MAX_SEQUENCE_LENGTH)
Y_train = pad_sequences(Y_sequences, maxlen=MAX_SEQUENCE_LENGTH)
换句话说,像“快速的棕色狐狸跳过懒狗”这样的序列被转换为一系列数字[1,2,3,4,5,6,1,7,8]
,然后填充:[0,0,0,.....0,1,2,3,4,5,6,1,7,8]
。目标数据(Y_train
)的转换方式类似。
当我尝试训练模型时,出现以下错误:ValueError: Error when checking target: expected activation_1 to have 3 dimensions, but got array with shape (3, 200)
。我怀疑这是因为TimeDistributed(Dense(MAX_SEQUENCE_LENGTH))
产生3D输出,但我不确定如何格式化Y_train
数据来解决问题?