Distrubuted TensorFlow:CreateSession仍在等待来自worker的响应:/ job:ps / replica:0 / task:0

时间:2017-03-23 05:21:18

标签: tensorflow

我正在尝试这里提供的示例: https://github.com/ischlag/distributed-tensorflow-example 我有两台机器:一台作为服务器,另一台作为工作。 (两台机器上的版本都是1.0.1)

我收到以下错误:

初始化的变量...... 我的tensorflow / core / distributed_runtime / master.cc:193] CreateSession仍在等待来自worker的响应:/ job:ps / replica:0 / task:0 I tensorflow / core / distributed_runtime / master.cc:193] CreateSession仍在等待来自worker的响应:/ job:worker / replica:0 / task:1 我的tensorflow / core / distributed_runtime / master.cc:193] CreateSession仍在等待来自worker的响应:/ job:worker / replica:0 / task:2

2 个答案:

答案 0 :(得分:0)

I had a similar issue that I was able to fix by adding a third node as a master to the ClusterSpec. My TF_CONFIG environment variable looks something like:

    TF_CONFIG = { 
        'cluster' : { 
            'master' : [ master_node01:2222 ],
            'ps' : [ps_node01:2222, ...]
            'worker' : [worker_node01:2222, ...]}
        'environment' : 'cloud',
        'task': {'type': current_task, 'index': current_index}}

答案 1 :(得分:0)

我遇到了同样的问题,经过几个小时的调试后,我发现问题是因为cluster_spec的顺序不正确。 task_index与ps / worker列表不匹配。我改变了订单后,它被修复了。