我在一个带有约束with tf.device('/cpu:0'):
的常规非分布式设置中通过图形代码生成了一个保存的检查点(强制模型参数驻留在CPU而不是GPU上)。
现在,我按照TF-Inception中的指南将相同的代码/图表转换为分布式设置。
现在,当我尝试在分布式设置中恢复检查点时,出现设备不匹配错误。有没有办法覆盖检查点文件中保存的要求?
我的新分布式代码将Saver和范围定义为:
if FLAGS.job_name == 'worker':
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
# ...same network-graph code... #
restorer = tf.train.Saver()
with tf.Session() as sess:
restorer.restore(sess, 'ResNet-L50.ckpt')
我的cluster
有一个ps
和一个worker
,两者都在localhost
上。错误行:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
完整错误跟踪:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 88, in train
restorer.restore(sess, 'ResNet-L50.ckpt')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1103, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 328, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 563, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 658, in _do_call
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
[[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Caused by op u'save/restore_slice_268/shape_and_slice', defined at:
File "dlaunch.py", line 85, in <module>
tf.app.run() # (tf.app.flags parsed here)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "dlaunch.py", line 81, in main
dtrainer.train(server.target, cluster_spec)
File "/home/muneeb/parkingtf/dtrainer.py", line 86, in train
restorer = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__
restore_sequentially=restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build
filename_tensor, vars_to_save, restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps
values = self.restore_op(filename_tensor, vs, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
preferred_shard=preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 201, in _restore_slice
preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 444, in apply_op
as_ref=input_arg.is_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 179, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2162, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
self._traceback = _extract_stack()
答案 0 :(得分:2)
以下一行:
with tf.Session() as sess:
...负责错误。不向tf.Session()
传递任何参数会创建一个进程内会话,该会话只能使用本地计算机上的设备。要在分布式模式下工作,您应该具有以下内容:
# Assuming you created `server = tf.train.Server(...)` earlier.
with tf.Session(server.target) as sess:
...或者,如果您要连接到其他流程:
# Assuming your server is in a different process.
with tf.Session("grpc://..."):
请注意,设备未存储在检查点文件中,但tf.train.replica_device_setter()
正在添加它们。设备配置现在有点棘手,而且我们正在努力简化它。