喀拉斯恢复fit_generator训练(LSTM模型),失去了最后的损失值

时间:2019-05-08 21:15:52

标签: tensorflow keras lstm resume

有人能帮助我找到一种方法,用fit_generator恢复LSTM模型的训练,而又不将损失值重现为inf吗?

背景:我正在训练一个LSTM模型,并且我有一个非常大的时间序列数据(许多采样时间),并且只有2个功能。因此,我的时间序列x数据的形状为N×2,其中N是一个非常大的数字。我使用批处理生成器将我的数据随机分成2个小批量的batch_N(其中batch_N比N小得多):

def batch_generator(batch_size, sequence_length): 
...
    for i in range(batch_size):
        ...
       x_batch[i] = batch_x_train_scaled[idx:idx+sequence_length]
       y_batch[i] = batch_y_train_scaled[idx:idx+sequence_length] 
 yield (x_batch, y_batch)

我还使用ModelCheckpoint来保存训练有素的模型

callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint, 
monitor='val_loss', verbose=1,save_weights_only=False, save_best_only=True)

此外,每次我想恢复训练时,首先加载最后保存的模型:

if True:
    try:
#         model.load_weights(path_checkpoint)
        model = load_model(path_checkpoint)

    except Exception as error:
        print("Error trying to load checkpoint.")
        print(error)

问题出在哪里??每次我恢复训练时,都会首先将新的批处理文件加载到fit_generator中,并且模型将使用最后的保存权重。损失值重置为inf。因此,在第一个恢复的训练时期结束时,无论训练结果有多好,该模型都会报告val_loss从inf改善到一定数量,从而覆盖了新的权重。问题在于,有时新的权重不如以前的最佳(由于这次模型使用新的批次数据进行训练),因此我将失去一些最佳的权重。

到目前为止,我已完成哪些工作来解决此问题?

方法之一(不成功):定义自定义损失函数:

def my_loss(y_true, y_pred):
    train_loss = binary_crossentropy(y_true, y_pred)
    validation_loss = 2*binary_crossentropy(y_true, y_pred)
    temp=tf.keras.backend.cast(validation_loss,'float16')
    if temp>1:  # update 1 to last best val_loss before resume training
        validation_loss=validation_loss+np.inf
# validation_loss=np.inf
    return tf.keras.backend.in_train_phase(train_loss, validation_loss)

model.compile(loss=my_loss, optimizer=optimizer)

方法一的结果:

Error:
---> 12     if temp>1:
TypeError: Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.

处理两个(不成功):定义自定义回调以保存模型:

best_val_loss = 1 # update 1 to last best val_loss before resume training

def saveModel(epoch,logs):
    val_loss = logs['val_loss']
    if val_loss < best_val_loss:
        best_val_loss=val_loss
        model.save('my_model.hdf5')

my_callback = LambdaCallback(on_epoch_end=saveModel)

方法二的结果:

UnboundLocalError: local variable 'best_val_loss' referenced before assignment

方法三(失败):定义自定义回调以保存模型:

best_val_loss = 1 # update 1 to last best val_loss before resume training

def saveModel(epoch,logs,best_val_loss):
    val_loss = logs['val_loss']
    if val_loss < best_val_loss:
        best_val_loss=val_loss
        model.save('my_model.hdf5')

my_callback = LambdaCallback(on_epoch_end=saveModel)

方法三的结果:

TypeError: saveModel() missing 1 required positional argument: 'best_val_loss'

0 个答案:

没有答案