我使用2台Ubuntu服务器来运行分布式tensorflow。每个服务器安装tensorflow 0.8.0。
我首先在server1上启动ps服务器: ``` ubuntu @ i-mxdcqm20:/ data1T5 / org_models / inception $ sudo bazel-bin / inception / imagenet_distributed_train \
- JOB_NAME =' PS' \ --task_id = 0 \ --ps_hosts =' 43.254.55.221:2222' \ --worker_hosts =' 61.160.41.85:2222' ```,
日志显示:
INFO:tensorflow:PS hosts are: ['43.254.55.221:2222']
INFO:tensorflow:Worker hosts are: ['61.160.41.85:2222']
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {61.160.41.85:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
当我运行sudo netstat -tunlp
时,服务器实际上正在侦听端口2222:
tcp6 0 0 :::2222 :::* LISTEN 3525/python
但是当我在server2上启动worker时,它仍然报告连接失败:
E0722 10:35:01.142377237 4045 tcp_client_posix.c:191] failed to connect to 'ipv4:43.254.55.221:2222': timeout occurred
我按照自述文件[{3}}运行代码,但我没有更改任何代码。