分布式Tensorflow错误/

时间:2016-07-19 20:23:40

标签: tensorflow distributed

运行分布式张量流(TF v0.9.0rc0)时,我启动3个参数服务器,然后启动6个工作器。参数服务器看起来很好,给出了消息Started server with target: grpc://localhost:2222。但工人们提出了其他错误(下面),我对此有疑问。

在我看来,有时计算机无法相互通信,从而导致socket error, connection refused错误。在初始化变量并给出Cannot assign a device错误时,工作人员似乎无法找到参数服务器。

任何人都可以帮助我理解这些错误个别意味着什么,每个错误的交易有多大,并且可能会指出如何根据需要修复它们?

具体来说:

  1. 为什么我会socket errors
  2. 为什么会出现Master init: Unavailable个问题?这是什么意思?
  3. 如何确保所请求的设备可用?
  4. 这看起来像我应该发布到tensorflow github account的问题页面吗?
  5. 设置注意事项:

    • 所有计算机都报告Tensorflow版本:0.9.0rc0python -c "import tensorflow as tf; print(tf.__version__);"), 虽然可能已经从源代码而不是pip包安装了一些,但重要的是。
    • 所有计算机都在同一个1Gb以太网交换机上。
    • 硬件大致相同,有些工作人员运行双GPU。

    所有这些都会出现此错误(ip addreses更改):

    E0719 12:06:17.711635677    2543 tcp_client_posix.c:173]  
     failed to connect to 'ipv4:192.168.xx.xx:2222': socket error: connection refused
    

    但所有非首席工作人员也给出了:

    E tensorflow/core/distributed_runtime/master.cc:202] Master init: Unavailable: 
    

    此外,一些非首席工作人员崩溃,发出此错误:

    Traceback (most recent call last):  
        File "main.py", line 219, in <module>  
            r.main()  
        File "main.py", line 119, in main  
            with sv.prepare_or_wait_for_session(server.target, config=tf.ConfigProto(gpu_options=gpu_options)) as sess:  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 691, in prepare_or_wait_for_sessionn max_wait_secs=max_wait_secs)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 282, in wait_for_session  
            sess.run([self._local_init_op])  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run
            run_metadata_ptr)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run  
            feed_dict_string, options, run_metadata)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run  
            target_list, options, run_metadata)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call  
            raise type(e)(node_def, op, message)  
        tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_23':
            Could not satisfy explicit device specification '/job:ps/task:3/device:CPU:0'
            because no devices matching that specification are registered in this process; available devices: 
                /job:ps/replica:0/task:0/cpu:0,
                /job:ps/replica:0/task:1/cpu:0,
                /job:ps/replica:0/task:2/cpu:0,
                /job:ps/replica:0/task:4/cpu:0,
                /job:worker/replica:0/task:0/cpu:0,
                /job:worker/replica:0/task:0/gpu:0,
                /job:worker/replica:0/task:1/cpu:0,
                /job:worker/replica:0/task:1/gpu:0,
                /job:worker/replica:0/task:2/cpu:0,
                /job:worker/replica:0/task:2/gpu:0 
    [[Node: save/restore_slice_23 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/task:3/device:CPU:0"](save/Const, save/restore_slice_23/tensor_name, save/restore_slice_23/shape_and_slice)]]
    Caused by op u'save/restore_slice_23', defined at:  
        File "main.py", line 219, in <module>  
            r.main()  
        File "main.py", line 101, in main  
            saver = tf.train.Saver()  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__  
            restore_sequentially=restore_sequentially)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build  
            filename_tensor, vars_to_save, restore_sequentially, reshape)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps  
            values = self.restore_op(filename_tensor, vs, preferred_shard)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
            preferred_shard=preferred_shard)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 202, in _restore_slice  
            preferred_shard, name=name)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 358, in _restore_slice  
            preferred_shard=preferred_shard, name=name)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op  
            op_def=op_def)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op  
            original_op=self._default_original_op, op_def=op_def)  
        File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1230, in __init__  
            self._traceback = _extract_stack()
    

1 个答案:

答案 0 :(得分:1)

我弄清楚我的问题是什么。

TL; DR :主管需要了解所有变量才能初始化 all 。非首席工作人员无法创建自己的变量。

我正在转换一个旧程序,其中所有工作人员都有一些自变量,但需要共享一些变量(我使用ZMQ将这些变量传递给分布式TensorFlow),并忘记初始化所有变量工人。我有类似

的东西
# Create worker specific variable
with tf.variable_scope("world_{}".format(**worker_id**)):
    w1 = tf.get_variable("weight", shape=(input_dim, hidden_dim), dtype=tf.float32, initializer=tf.truncated_normal_initializer())

而不是做这样的事情:

# Create all worker specific variables
all_w1 = {}
for worker in worker_cnt:
    with tf.variable_scope("world_{}".format(**worker_id**)):  
        all_w1[worker] = tf.get_variable("weight", shape=(input_dim, hidden_dim), dtype=tf.float32, initializer=tf.truncated_normal_initializer())

# grab worker specific variable
w1 = all_w1[**worker_id**] 

至于错误......

我怀疑这导致一些工人死于上面的Master init: Unavailable:错误消息,因为主管从不知道工人想要创建的变量。

我没有一个可靠的解释为什么设备不可用(第3个)错误没有找到该设备,但我认为它再次,因为只有主人可以创建它,他不知道新的变量

第一个错误似乎是因为计算机在失败后还没准备好说话,因为我在修复后没有看到错误。如果我杀了一个工人并重新启动他,我仍然会看到它,但如果它们一起启动它似乎不是问题。

无论如何,如果以后有人遇到同样的错误,我希望这会有所帮助。