在不同的机器上恢复TensorFlow模型

时间:2018-06-07 23:35:09

标签: python tensorflow

我在GPU集群上训练TensorFlow模型,使用

保存模型
saver = tf.train.Saver()
saver.save(sess, config.model_file, global_step=global_step)

现在我正在尝试使用

恢复模型
saver = tf.train.import_meta_graph('model-1000.meta')
saver.restore(sess,tf.train.latest_checkpoint(save_path))

用于评估,在另一个系统上。问题在于saver.restore 产生以下错误:

    Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 205, in <module>
    main()
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 162, in main
    restore_and_evaluate(save_path, model_file, output_dir)
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 127, in restore_and_evaluate
    saver.restore(sess, tf.train.latest_checkpoint(save_path))
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1857, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files
    for single_filename in filename
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /afs/cs.stanford.edu/u/jdeaton/dfs/unet; No such file or directory

似乎存在一些存储在模型中的路径或者checkpoint文件形成了它所训练的系统,这些路径在我正在评估的系统上不再有效。复制model-X.metamodel-X.indexcheckpoint文件后,如何在其他计算机上还原模型(用于评估)?

2 个答案:

答案 0 :(得分:1)

默认情况下,Saver对象会将绝对模型检查点路径写入checkpoint文件中。因此,tf.train.latest_checkpoint(save_path)返回的路径是旧计算机上的绝对路径。

临时解决方案:

  1. 将实际模型文件的路径直接传递给restore方法,而不是tf.train.latest_checkpoint的结果。
  2. 手动编辑checkpoint文件,这是一个简单的文本文件。

长期解决方案:

saver = tf.train.Saver(save_relative_paths=True)

答案 1 :(得分:0)

使用您喜欢的文本编辑器打开检查点文件,只需将其中找到的绝对路径更改为仅文件名即可。