Question

我试图在每个第十个时期和最后一个时期之后保存我的模型，但是在每个保存的时期我都收到张量流消息。当我尝试加载保存的模型时，出现错误。这仅在我使用多GPU训练时发生。当我使用单个GPU进行训练时，我不会在每个保存的时期都得到tensorflow消息。我注意到，在进度条旁边，多GPU模型显示8/7，并且消息在时代结束之前就出现了。

这是否意味着模型在时代完成之前就已经保存了？如果是这样，什么会导致模型过早保存？

保存代码：

model_checkpoint = ModelCheckpoint('unet_{epoch:04}.model', monitor=observe_var, save_best_only = False, period = 10)
model.fit(train_x, train_y, batch_size = 2, epochs = 600, verbose = 1, shuffle = True, validation_split = .2, callbacks = [model_checkpoint]) 
model.save('unet_final.model')

具有多GPU的jupyter笔记本打印输出：

Epoch 10/600
6/7 [========================>.....] - ETA: 1s - loss: -0.1712 - dice_coef: 0.1712WARNING:tensorflow:From /home/diablo-redhat/anaconda3/envs/gputest/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: models/unet_0010.model/assets
8/7 [==================================] - 15s 2s/sample - loss: -0.1670 - dice_coef: 0.1670 - val_loss: -0.1628 - val_dice_coef: 0.1628

具有单GPU的jupyter笔记本打印输出：

Epoch 10/600
7/7 [==================================] - 13s 2s/sample - loss: -0.1554 - dice_coef: 0.1554 - val_loss: -0.1661 - val_dice_coef: 0.1661

加载模型时出错：

FailedPreconditionError: Error while reading resource variable conv3d/kernel_23175 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/conv3d/kernel_23175/class tensorflow::Var does not exist. [Op:ReadVariableOp]

ML模型无法正确保存？

0 个答案: