Question

我正在我自己的计算机上使用相同的操作系统和python版本尝试this tensorflow distributed tutorial。我创建了第一个脚本并在终端中运行它，然后我打开另一个终端并运行第二个脚本并得到以下错误：

E0629 10:11:01.979187251   15265 tcp_server_posix.c:284]     bind addr=[::]:2222: Address already in use
E0629 10:11:01.979243221   15265 server_chttp2.c:119]        No address added out of total 1 resolved
Traceback (most recent call last):
File "worker0.py", line 7, in <module>
task_index=0)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/server_lib.py", line 142, in __init__
server_def.SerializeToString(), status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Could not start gRPC server

尝试official distributed tutorial时出现类似错误。

编辑：我在另一台机器上尝试使用相同的软件包，现在我收到以下错误日志：

E0629 11:17:44.500224628   18393 tcp_server_posix.c:284]     bind addr=[::]:2222: Address already in use
E0629 11:17:44.500268362   18393 server_chttp2.c:119]        No address added out of total 1 resolved
Segmentation fault (core dumped)

可能是什么问题？

Answer 1

问题可能是您为两个工作人员使用相同的端口号（2222）。每个端口号只能由任何给定主机上的一个进程使用。这就是错误“bind addr = [::]：2222：地址已在使用中”意味着什么。

我猜你的集群规范中有两次“localhost：2222”，或者你已经为两个任务指定了相同的task_index。

我希望有所帮助！

分布式TensorFlow示例不适用于TensorFlow 0.9

1 个答案: