分布式Tensorflow:CreateSession仍在等待不同的节点

时间:2017-10-10 03:37:59

标签: python tensorflow distributed grpc

我想让mnist_replica.py示例工作。根据{{​​3}}问题的建议,我正在指定设备过滤器。

当ps和worker任务位于同一节点上时,我的代码可以正常工作。当我尝试将ps任务放在node2上的node1和worker任务时,我得到" CreateSession仍在等待#34;。

例如:

伪分布式版本(有效!)

Node1的终端转储(实例1)

node1 $ python mnist_replica.py --worker_hosts=node1:2223 --job_name=ps --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = ps
task index = 0
2017-10-10 11:09:16.637006: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 11:09:16.637075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> node1:2223}
2017-10-10 11:09:16.640114: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
...

Node1的终端转储(实例2)

node1 $ python mnist_replica.py --worker_hosts=node1:2223 --job_name=worker --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = worker
task index = 0
2017-10-10 11:11:12.784982: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 11:11:12.785046: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223}
2017-10-10 11:11:12.787685: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2223
Worker 0: Initializing session...
2017-10-10 11:11:12.991784: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session 418af3aa5ce103a3 with config: device_filters: "/job:ps" device_filters: "/job:worker/task:0" allow_soft_placement: true
Worker 0: Session initialization complete.
Training begins @ 1507648273.272837
1507648273.443305: Worker 0: training step 1 done (global step: 0)
1507648273.454537: Worker 0: training step 2 done (global step: 1)
...

2个节点已分发(无法工作)

Node1的终端转储

node1 $ python mnist_replica.py --worker_hosts=node2:2222 --job_name=ps --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = ps
task index = 0
2017-10-10 10:54:27.419949: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 10:54:27.420064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> node2:2222}
2017-10-10 10:54:27.426168: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
...

Node2的终端转储

node2 $ python mnist_replica.py --ps_hosts=node1:2222 --worker_hosts=node2:2222 --job_name=worker --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = worker
task index = 0
2017-10-10 10:51:13.303021: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> node1:2222}
2017-10-10 10:51:13.303081: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
2017-10-10 10:51:13.308288: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
Worker 0: Initializing session...
2017-10-10 10:51:23.508040: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2017-10-10 10:51:33.508247: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
...

两个节点都运行CentOS7,Tensorflow R1.3,Python 2.7。节点可以通过ssh相互通信,主机名正确,防火墙被禁用。有什么遗失?

我需要采取哪些额外步骤来确保节点可以使用GRPC相互通信? 谢谢。

2 个答案:

答案 0 :(得分:0)

我认为您最好检查ClusterSpec和服务器部分。例如,你应该检查node1和node2的ip地址,检查端口和任务索引等。我想给出具体的建议,但很难在没有代码的情况下给你建议。感谢。

答案 1 :(得分:0)

问题是防火墙阻塞了端口。我在问题的所有节点上禁用了防火墙,问题解决了!