Docker容器之间的间歇性连接失败

时间:2018-02-12 10:44:07

标签: docker networking

描述

我在同一个覆盖网络中的容器之间遇到了一些间歇性的通信问题。几个星期以来,我一直在努力寻找解决方案,但我在谷歌看到的与通信问题相关的一切都与我所看到的完全不符。所以我希望有人可以帮我弄清楚发生了什么。

We are using Docker 17.06
We are using standalone swarm with three masters and one node.
We have multiple overlay networks

连接到每个覆盖网络的容器:

1 container running Apache Tomcat 8.5 and HAproxy 1.7 (called the controller)
1 container just running Apache Tomcat 8.5 (called the apps container)
3 containers running Postgresql 9.6
1 container running an FTP service
1 container running Logstash

重现问题的步骤

创建新的覆盖网络 附上容器 查看日志,过一会儿就会看到错误

描述您收到的结果

“controller”每隔几秒钟在“apps”容器上轮询一个servlet。 每隔15分钟左右,我们会在“控制器”的日志文件中看到连接超时错误。而且,当控制器试图在其中一个Postgresql容器中访问其数据库时,我们看到连接尝试失败。

轮询应用容器时出错

  

org.apache.http.conn.ConnectTimeoutException:连接到srvpln50-webapp_1.0-1:5050 [srvpln50-webapp_1.0-1 / 10.0.1.6]失败:连接超时

尝试连接数据库时出错

  

JavaException:com.ebasetech.xi.exceptions.FormRuntimeException:使用数据库连接CONTROLLER,SQLEx获取连接时出错   StandardPoolDataSource中的ception:getConnection异常:java.sql.SQLException:StandardPoolDataSource中的SQLException:getConnection无连接可用java.sql.SQLException:不能   获取URL jdbc的连接:postgresql:// srvpln50-controller-db_latest:5432 / ctrldata:连接尝试失败。

我打开了docker deamon节点上的调试模式

每次发生这些错误时,我都会在docker日志中看到以下相关条目:

Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422797691Z" level=debug msg="Name To resolve: srvpln50-webapp_1.0-1."
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422905040Z" level=debug msg="Lookup for srvpln50-webapp_1.0-1.: IP [10.0.1.6]"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.648262289Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716329366Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716952000Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.802320875Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944189349Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944770233Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"

IP 10.0.0.3 is the "controller" container
IP 10.0.0.6 is the "apps" container
IP 10.0.0.9 is the "postgresql" container that the "controller" is trying to connect to.

描述您期望的结果

没有连接错误

您认为重要的其他信息(例如偶尔会发生问题)

泊坞窗版本的输出

客户端

Version: 17.06.1-ce
API version: 1.30
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:51:12 2017
OS/Arch: linux/amd64

服务器

Version: 17.06.1-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:50:04 2017
OS/Arch: linux/amd64
Experimental: false

码头信息输出

Containers: 19
 Running: 19
 Paused: 0
 Stopped: 0
Images: 18
Server Version: 17.06.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 385
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-108-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.784GiB
Name: swarm-node-1
ID: O5ON:VQE7:IRV6:WCB7:RQO4:RIZ4:XFHE:AUCX:ZLM2:GPZL:DXQO:BCIX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 217
 Goroutines: 371
 System Time: 2018-02-09T15:50:01.902816981Z
 EventsListeners: 2
Registry: https://index.docker.io/v1/
Labels:
 name=swarm-node-1
Experimental: false
Cluster Store: etcd://localhost:2379/store
Cluster Advertise: 10.80.120.13:2376
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

其他环境详细信息(AWS,VirtualBox,物理等)

Swarm主机,节点和容器在裸机服务器上运行Ubuntu 16.04

如果我遗漏了任何有助于诊断的内容,请告诉我。

1 个答案:

答案 0 :(得分:0)

阅读了Google上Docker人员的许多评论,关于在最新版Docker中修复的许多通信问题,我们已升级到17.12 CE,我们遇到的所有问题都消失了。

很想知道这个问题是什么,但我很高兴看到它们消失了。