注意,在Keras Seq2seq双向LSTM中,损耗并未降低

时间:2019-10-20 19:40:27

标签: python keras lstm bidirectional seq2seq

任何人都可以理解为什么这种模型的损失没有减少吗?

在吴安德(Andrew Ng)的深度学习专长结束时,我尝试将双向LSTM与注意力模型集成 (https://www.coursera.org/learn/nlp-sequence-models/notebook/npjGi/neural-machine-translation-with-attention) 但由于某种原因,该模型似乎无法收敛

我正在Google Colab上运行它

网络将两个张量作为形状输入:

encoder_input_data[m, 262, 28]
decoder_target_data[m, 28, 28]

输出是27个oneHot向量的列表

oneHot向量的长度为28:

(字母+结束键+起始键中的26个字符)

*总体结构为:*

0)Input [262,28]->

1)编码器:双向LSTM->

2)向前和向后串联到encoder_outputs->

3)解码器LSTM +注意->

___ *将先前的解码状态s与编码器的每a(t)连接

___ *让它穿过两个密集层并计算Alpha

___ *对每个alpha(a(t))* a(t)求和

4)Softmax层并获得结果

from keras.models import Model
from keras.layers import merge, Input, LSTM, Dense, Bidirectional, concatenate, Concatenate
from keras.layers import RepeatVector, Activation, Permute, Dot, Input, Multiply
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.activations import softmax
from textwrap import wrap
import re
import random
import string
import numpy as np
import copy
from google.colab import files
from google.colab import drive

#drive.mount('/content/drive')
#files.upload()

    #returnData() creates 3 vectors:
    #encoder_input_data[m, 262, 28]
    #decoder_input_data[m, 28, 28] <- not used for now
    #decoder_i_data[m, 28, 28]

#special softmax needed for the attention layer
def softMaxAxis1(x):
    return softmax(x,axis=1)

#layers needed for the attention
repeator = RepeatVector(262)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softMaxAxis1, name='attention_weights')
dotor = Dot(axes = 1)

#compute one timestep of attention
#repeat s(t-1) for all the a(t) so far and concatenate them so that
#the algorithm can select the old a(t) based on current s
#let a dense layer compute the energies and a softmax decide
def one_step_attention(a, s_prev):
    s_prev = repeator(s_prev)
    concat = concatenator([a, s_prev])
    e = densor1(concat)
    energies = densor2(e)
    alphas = activator(energies)
    context = dotor([alphas, a])    
    return context

#variables needed for the model
encoder_input_data, decoder_input_data, decoder_target_data = returnData()
batch_size = 64
epochs = 50
latent_dim = 128
num_samples = 1000
num_tokens = 28
Tx = 262

#encoder part with a bi-LSTM with dropout
encoder_inputs = Input(shape=(Tx, num_tokens))
encoder = Bidirectional(LSTM(latent_dim, return_sequences=True ,dropout = .7))
encoder_outputs = encoder(encoder_inputs)

#decoder part with a regular LSTM
decoder_lstm = LSTM(latent_dim*2, return_state=True)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

#initialize the parameters needed for computing attention alphas 
s0 = Input(shape=(latent_dim*2,))
c0 = Input(shape=(latent_dim*2,))
s = s0
c = c0
outputs=[]

#run attention for each target timestep Ty
for t in range(num_tokens-1):
        context = one_step_attention(encoder_outputs, s)
        s, _, c = decoder_lstm(context, initial_state = [s, c])
        out = decoder_dense(s)
        outputs.append(out)

#define the model and connect the graph
model = Model([encoder_inputs, s0, c0], outputs)

#select optimizer, loss, early_stopping
model.compile(optimizer='adam', loss='categorical_crossentropy')
keras_callbacks = [EarlyStopping(monitor='val_loss', patience=30)]

#prepare empty arrays for s0 c0 and put target data in the same for of outputs
s0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
c0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
outputs = list(decoder_target_data.swapaxes(0,1))

#fit the model with the expected dimensions of input/output
model.fit(
    [encoder_input_data, s0, c0], 
    outputs,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1
)

#save and download the model
model.save('s2s.h5')
files.download("s2s.h5")

训练时,我得到以下结果:

Train on 11671 samples, validate on 1297 samples
Epoch 1/50
11671/11671 [==============================] - 719s 62ms/step - loss: 86.7096 - dense_77_loss: 1.0157 - val_loss: 85.6579 - val_dense_77_loss: 0.5682
Epoch 2/50
11671/11671 [==============================] - 672s 58ms/step - loss: 87.6775 - dense_77_loss: 2.0322 - val_loss: 88.3077 - val_dense_77_loss: 2.4503
Epoch 3/50
11671/11671 [==============================] - 670s 57ms/step - loss: 86.1718 - dense_77_loss: 0.6686 - val_loss: 85.1008 - val_dense_77_loss: 0.1771
Epoch 4/50
11671/11671 [==============================] - 666s 57ms/step - loss: 85.1310 - dense_77_loss: 0.1196 - val_loss: 84.8357 - val_dense_77_loss: 0.0205
Epoch 5/50
11671/11671 [==============================] - 666s 57ms/step - loss: 84.7977 - dense_77_loss: 0.0173 - val_loss: 84.7414 - val_dense_77_loss: 0.0072
Epoch 6/50
11671/11671 [==============================] - 655s 56ms/step - loss: 87.8612 - dense_77_loss: 2.4636 - val_loss: 87.3005 - val_dense_77_loss: 1.3145
Epoch 7/50
11671/11671 [==============================] - 662s 57ms/step - loss: 88.1340 - dense_77_loss: 2.5091 - val_loss: 89.6831 - val_dense_77_loss: 4.6627
Epoch 8/50
11671/11671 [==============================] - 666s 57ms/step - loss: 88.2948 - dense_77_loss: 2.6113 - val_loss: 86.4465 - val_dense_77_loss: 0.1490
Epoch 9/50
11671/11671 [==============================] - 666s 57ms/step - loss: 87.3295 - dense_77_loss: 1.8405 - val_loss: 85.1743 - val_dense_77_loss: 0.1448
Epoch 10/50
11671/11671 [==============================] - 661s 57ms/step - loss: 85.0535 - dense_77_loss: 0.1180 - val_loss: 84.8204 - val_dense_77_loss: 0.0236
Epoch 11/50
11671/11671 [==============================] - 662s 57ms/step - loss: 84.7884 - dense_77_loss: 0.0179 - val_loss: 84.7479 - val_dense_77_loss: 0.0050
Epoch 12/50
11671/11671 [==============================] - 665s 57ms/step - loss: 87.0466 - dense_77_loss: 1.9977 - val_loss: 89.4181 - val_dense_77_loss: 4.4239
Epoch 13/50
 1216/11671 [==>...........................] - ETA: 9:34 - loss: 89.7864 - dense_77_loss: 4.8242

我们将不胜感激!

0 个答案:

没有答案