无法运行分布式imagenet初始模型(连接失败)

时间:2016-07-22 04:01:57

标签: tensorflow

我使用2台Ubuntu服务器来运行分布式tensorflow。每个服务器安装tensorflow 0.8.0。

我首先在server1上启动ps服务器: ``` ubuntu @ i-mxdcqm20:/ data1T5 / org_models / inception $ sudo bazel-bin / inception / imagenet_distributed_train \

  

- JOB_NAME =' PS' \   --task_id = 0 \   --ps_hosts =' 43.254.55.221:2222' \   --worker_hosts =' 61.160.41.85:2222'   ```,

日志显示:

INFO:tensorflow:PS hosts are: ['43.254.55.221:2222'] INFO:tensorflow:Worker hosts are: ['61.160.41.85:2222'] I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {61.160.41.85:2222} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

当我运行sudo netstat -tunlp时,服务器实际上正在侦听端口2222:

tcp6 0 0 :::2222 :::* LISTEN 3525/python

但是当我在server2上启动worker时,它仍然报告连接失败: E0722 10:35:01.142377237 4045 tcp_client_posix.c:191] failed to connect to 'ipv4:43.254.55.221:2222': timeout occurred

我按照自述文件[{3}}运行代码,但我没有更改任何代码。

0 个答案:

没有答案