打破从TF v1.3到v1.4的分布式培训的变化:“UnavailableError:尝试连接http1.x服务器”

时间:2017-11-06 18:02:57

标签: tensorflow grpc

使用此行创建用于分布式培训的托管会话时:

with sv.managed_session(server.target, config=config) as sess, sess.as_default():

我在主要工作人员身上得到了这个错误(完全堆栈跟踪):

  

tensorflow.python.framework.errors_impl.UnavailableError:尝试连接http1.x服务器

参数服务器上的所有内容似乎仍然没有报告:

E1106 11:26:32.844686639    5543 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}   
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222

我在使用从源构建的tensorflow的新v1.4时才会收到此错误(从pip安装时发现相同的问题)。在v1.3中一切正常。有谁知道是否有一个突破性的变化,我假设有关tensorflow如何与grpc一起工作?

我想知道这是否与http2 vs http1有关?我看到GRPC似乎与http2上的protobuf一起使用,这似乎表明它试图与http1连接,但仍然无法解释为什么在将v1.3升级到v1.4时会出现这种情况

是否有人知道该错误

  

UnavailableError:尝试连接http1.x服务器

指的是什么可能是解决方法?

我正在研究RedHat Linux并试图在同一个本地主机上的进程之间进行分布式培训......甚至没有尝试通过网络。我很感激任何想法,希望这也可以帮助其他人解决同样的问题。

完整的堆栈跟踪:

E1106 11:28:24.383745692    5787 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server 

with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: 

Trying to connect an http1.x server
Traceback (most recent call last):
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in 

_do_call
    return fn(*args)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in 

_run_fn
    self._extend_graph()
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in 

_extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, 

in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "worker.py", line 426, in <module>
    main()
  File "worker.py", line 418, in main
    run(args, server)
  File "worker.py", line 174, in run
    sess.run(trainer.sync)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in 

_run
    feed_dict_tensor, options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in 

_do_run
    options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in 

_do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

2 个答案:

答案 0 :(得分:1)

如果您遵循@NoahEisen建议和

name

你会看到更多信息如下:

export GRPC_VERBOSITY="DEBUG"

我支持代理,但我只是尝试在localhost上进行分布式培训。由于某种原因它甚至尝试通过代理连接IP 127.0.0.1应该等同于localhost吗? IE特别注意到这一部分:

E1108 17:37:57.085195825   17711 ev_epoll1_linux.c:1051]     grpc epoll fd: 5
D1108 17:37:57.085309439   17711 ev_posix.c:111]             Using polling engine: epoll1
D1108 17:37:57.085380147   17711 dns_resolver.c:301]         Using native dns resolver
I1108 17:37:57.085819333   17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584   17711 tcp_server_posix.c:322]     Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518   17830 http_proxy.c:70]            userinfo found in proxy URI
I1108 17:38:02.611335569   17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server

我想这在我的python代码中很懒。 如果我将ps更改为&#34; localhost&#34;显然在集群规范而不是IP 127.0.0.1中,一切似乎在TF1.4中再次起作用,因为它没有尝试通过我的代理服务器连接到localhost(事实上,我认为只有HTTP1.x)。

@PeteWaren - 这是否构成tensorflow或grpc中的实际错误?这些注释应该等效localhost = 127.0.0.1吗?无论哪种方式,其处理方式已从TF1.3变为TF1.4

感谢大家的帮助

答案 1 :(得分:0)

回答因为我没有足够的声誉来发表评论(正在努力!)。

我怀疑这是TensorFlow问题,而不是gRPC问题,但是如果你想从gRPC看到更多的调试信息,请尝试: export GRPC_TRACE = all export GRPC_VERBOSITY = DEBUG