我使用Keras API在Tensorflow 2.0中训练了一个包含RNN的文本分类模型。我使用here中的tf.distribute.MirroredStrategy()
在多个GPU(2)上训练了该模型。在每个时期之后,我都使用tf.keras.callbacks.ModelCheckpoint('file_name.h5')
保存了模型的检查点。
现在,我想继续训练,从上次保存的检查点开始停止使用相同数量的GPU。像这样在tf.distribute.MirroredStrategy()
内加载检查点之后-
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model =tf.keras.models.load_model('file_name.h5')
,它引发以下错误。
File "model_with_tfsplit.py", line 94, in <module>
model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
model._make_train_function()
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
params=self._collected_trainable_weights, loss=self.total_loss)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
grads = self.get_gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
grads = gradients.gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
for x in xs
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call
现在我不确定问题出在哪里。另外,如果我不使用镜像策略来使用多个GPU,则训练从头开始,但经过几步,它达到了与保存模型之前相同的准确性和损失值。尽管不确定这种行为是否正常。
谢谢! 瑞沙(Rishabh Sahrawat)
答案 0 :(得分:1)
我解决了类似于@Srihari Humbarwadi的问题,但是在get_model函数内部移动策略范围有所不同。 TF's docu对此进行了类似的描述:
def get_model(strategy):
with strategy.scope():
...
return model
并在进行以下训练之前先调用它:
strategy = tf.distribute.MirroredStrategy()
model = get_model(strategy)
model.load_weights('file_name.h5')
不幸的是打电话给
model =tf.keras.models.load_model('file_name.h5')
不启用多GPU训练。我的猜测是它与.h5
模型格式有关。也许它可以使用Tensorflow原生.pb
格式。
答案 1 :(得分:0)
在分布式作用域下创建模型,然后使用load_weights
方法。
在此示例中,get_model
返回tf.keras.Model
def get_model():
...
return model
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = get_model()
model.load_weights('file_name.h5')
model.compile(...)
model.fit(...)