Tensorflow:将检查点变量恢复到分布式设置中

时间:2016-05-19 04:19:43

标签: tensorflow deep-learning

我在一个带有约束with tf.device('/cpu:0'):的常规非分布式设置中通过图形代码生成了一个保存的检查点(强制模型参数驻留在CPU而不是GPU上)。 现在,我按照TF-Inception中的指南将相同的代码/图表转换为分布式设置。 现在,当我尝试在分布式设置中恢复检查点时,出现设备不匹配错误。有没有办法覆盖检查点文件中保存的要求? 我的新分布式代码将Saver和范围定义为:

if FLAGS.job_name == 'worker':
    with tf.device(tf.train.replica_device_setter(
            worker_device="/job:worker/task:%d" % FLAGS.task_id,
            cluster=cluster_spec)):
        # ...same network-graph code... #
        restorer = tf.train.Saver()
        with tf.Session() as sess:
            restorer.restore(sess, 'ResNet-L50.ckpt')

我的cluster有一个ps和一个worker,两者都在localhost上。错误行:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
     [[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]

完整错误跟踪:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
  File "dlaunch.py", line 85, in <module>
    tf.app.run()      # (tf.app.flags parsed here)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dlaunch.py", line 81, in main
    dtrainer.train(server.target, cluster_spec)
  File "/home/muneeb/parkingtf/dtrainer.py", line 88, in train
    restorer.restore(sess, 'ResNet-L50.ckpt')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1103, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 328, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 563, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 658, in _do_call
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
     [[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Caused by op u'save/restore_slice_268/shape_and_slice', defined at:
  File "dlaunch.py", line 85, in <module>
    tf.app.run()      # (tf.app.flags parsed here)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dlaunch.py", line 81, in main
    dtrainer.train(server.target, cluster_spec)
  File "/home/muneeb/parkingtf/dtrainer.py", line 86, in train
    restorer = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build
    filename_tensor, vars_to_save, restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps
    values = self.restore_op(filename_tensor, vs, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
    preferred_shard=preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 201, in _restore_slice
    preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
    preferred_shard=preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 444, in apply_op
    as_ref=input_arg.is_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 179, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2162, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
    self._traceback = _extract_stack()

1 个答案:

答案 0 :(得分:2)

以下一行:

with tf.Session() as sess:

...负责错误。不向tf.Session()传递任何参数会创建一个进程内会话,该会话只能使用本地计算机上的设备。要在分布式模式下工作,您应该具有以下内容:

# Assuming you created `server = tf.train.Server(...)` earlier.
with tf.Session(server.target) as sess:

...或者,如果您要连接到其他流程:

# Assuming your server is in a different process.
with tf.Session("grpc://..."):

请注意,设备未存储在检查点文件中,但tf.train.replica_device_setter()正在添加它们。设备配置现在有点棘手,而且我们正在努力简化它。