Docker随机重启所有服务

时间:2019-07-18 14:51:08

标签: docker docker-swarm

我们目前遇到所有Docker群服务(生产中)的间歇性和随机性重启。这种行为是在几天前开始的,到目前为止,我们的服务已重新启动约6次。 当我检查/var/log/messages时,我们在重新启动服务时会看到以下内容

Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.777591370+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:xqo1jd54dy9h4si7p5kd147lf leaving:false netPeers:2 entries:16 Queue qLen:0 netMsg/s:0"
Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.778866829+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:hlxpttt1y61naebb1ehva5oa0 leaving:false netPeers:2 entries:42 Queue qLen:0 netMsg/s:0"
Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.778894355+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:hccgvewpmr4qg93smkf1zb3gk leaving:false netPeers:2 entries:4 Queue qLen:0 netMsg/s:0"
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 25: Deprecated option KeyRegenerationInterval
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 26: Deprecated option ServerKeyBits
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 40: Deprecated option RSAAuthentication
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 42: Deprecated option RhostsRSAAuthentication
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 61: Deprecated option UseLogin
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.219729414+02:00" level=warning msg="memberlist: Refuting a suspect message (from: b19e7378f796)"
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.242915522+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8 session.id=mhf0hqu2wo4t6x1n8ja4tzwr3 sessionID=mhf0hqu2wo4t6x1n8ja4tzwr3
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.377485716+02:00" level=error msg="agent: session failed" backoff=100ms error="node not registered" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.378175039+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.378223959+02:00" level=info msg="waiting 78.699854ms before registering session" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.392797756+02:00" level=info msg="worker 2ni1yglnxjxlpopn7kwxh6oh5 was successfully registered" method="(*Dispatcher).register"
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.497010348+02:00" level=info msg="worker 0bf0j7rcw6xg7pda5lka7jyy8 was successfully registered" method="(*Dispatcher).register"
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 452 0x000001C4 - no destination available
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 452 0x000001C4 - no destination available
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 451 0x000001C3 - no destination available
Jul 18 14:28:40 MASTER_SERVER NetworkManager[8489]: <info>  [1563452920.8734] manager: (veth33d0b84): new Veth device (/org/freedesktop/NetworkManager/Devices/1919)
Jul 18 14:28:40 MASTER_SERVER NetworkManager[8489]: <info>  [1563452920.8750] manager: (veth1ba455b): new Veth device (/org/freedesktop/NetworkManager/Devices/1920)
Jul 18 14:28:40 MASTER_SERVER kernel: br0: port 2(veth260) entered blocking state
Jul 18 14:28:40 MASTER_SERVER kernel: br0: port 2(veth260) entered disabled state
Jul 18 14:28:40 MASTER_SERVER kernel: device veth260 entered promiscuous mode
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info>  [1563452921.0567] manager: (veth9b724a9): new Veth device (/org/freedesktop/NetworkManager/Devices/1921)
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info>  [1563452921.0604] manager: (veth328f5ab): new Veth device (/org/freedesktop/NetworkManager/Devices/1922)
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered blocking state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered disabled state
Jul 18 14:28:41 MASTER_SERVER kernel: device veth83 entered promiscuous mode
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered blocking state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered forwarding state
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=af9c57b5a3e15cf84eb6c11e81b9b1283c86d289819600ddd83fd5cf791141cf module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=08e21510d82e56462f4bcfe5e4d734c8fcf21c92470afcf711bca5c7b1a29e88 module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.126973723+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.127491897+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.128341866+02:00" level=warning msg="rmServiceBinding 208e26beb7581a7e09041ebf23d6170f0f7ad7785a28b0d5b37b4194cca8d64e possible transient state ok:false entries:0 set:false "
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=99889097e3444a485567d85d5dcdbd1f56c49369207f3137d7c80d7c8efd1168 module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.132298072+02:00" level=warning msg="rmServiceBinding 055c400762be47b9e7c09c3a948c17b049a91b0cb53fbcefaee5f48f448fb91a possible transient state ok:false entries:0 set:false "
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.140787013+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info>  [1563452921.1487] manager: (veth7b97007): new Veth device (/org/freedesktop/NetworkManager/Devices/1923)
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered disabled state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 16(veth257) entered disabled state
...
MORE OF THE SAME
...
Jul 18 14:28:54 MASTER_SERVER kernel: br0: port 7(veth82) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: device veth82 left promiscuous mode
Jul 18 14:28:54 MASTER_SERVER kernel: br0: port 7(veth82) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: docker_gwbridge: port 17(veth7648a95) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: device veth7648a95 left promiscuous mode
Jul 18 14:28:54 MASTER_SERVER kernel: docker_gwbridge: port 17(veth7648a95) entered disabled state
Jul 18 14:28:54 MASTER_SERVER NetworkManager[8489]: <info>  [1563452934.9971] device (veth7648a95): released from master device docker_gwbridge
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 15(veth256) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: device veth256 left promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 15(veth256) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered blocking state
Jul 18 14:28:55 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: device veth865a1b5 entered promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 11(veth253) entered disabled state
Jul 18 14:28:55 MASTER_SERVER dockerd: time="2019-07-18T14:28:55+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.155/32 -j MARK --set-mark 458]:  (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.155/32 -j MARK --set-mark 458: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:55 MASTER_SERVER kernel: device veth253 left promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 11(veth253) entered disabled state
Jul 18 14:28:55 MASTER_SERVER dockerd: time="2019-07-18T14:28:55.049845009+02:00" level=error msg="Failed to delete firewall mark rule in sbox e6c95a7 (16cba4b): reexec failed: exit status 5"
... 
MORE OF THIS
...
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.877371310+02:00" level=warning msg="rmServiceBinding a22765c0f00db0a4949b009129fa3efc78e2450567e759ff306dcfd32ebe0043 possible transient state ok:false entries:0 set:false "
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.893365848+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox f9cd880 (f0562de): no such process"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.893415513+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox f9cd880 (f0562de): no such process"
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9029] device (veth8339f31): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 3(veth8339f31) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 3(veth8339f31) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9250] device (veth242a19a): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 5(veth242a19a) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 5(veth242a19a) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 6(veth08b162c) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 6(veth08b162c) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9454] device (veth08b162c): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9647] device (veth1c69a54): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455]:  (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975416212+02:00" level=error msg="Failed to delete firewall mark rule in sbox f9cd880 (f0562de): reexec failed: exit status 5"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975578597+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox fd62ef8 (7801a00): no such process"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975609232+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox fd62ef8 (7801a00): no such process"
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 14(veth1c69a54) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 14(veth1c69a54) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9814] device (veth865a1b5): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9902] manager: (veth90d68ef): new Veth device (/org/freedesktop/NetworkManager/Devices/1988)
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info>  [1563452937.9937] manager: (veth1c52f1f): new Veth device (/org/freedesktop/NetworkManager/Devices/1989)
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered disabled state
Jul 18 14:28:58 MASTER_SERVER kernel: device veth713fa85 entered promiscuous mode
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered forwarding state
Jul 18 14:28:58 MASTER_SERVER NetworkManager[8489]: <info>  [1563452938.0130] manager: (vethdc96c98): new Veth device (/org/freedesktop/NetworkManager/Devices/1990)
Jul 18 14:28:58 MASTER_SERVER NetworkManager[8489]: <info>  [1563452938.0142] manager: (veth713fa85): new Veth device (/org/freedesktop/NetworkManager/Devices/1991)
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered disabled state
Jul 18 14:28:58 MASTER_SERVER kernel: device veth269 entered promiscuous mode
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered forwarding state
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/5cadb3994f4d81174c1b686028a6fd3e917c9b8b2ac62b17c4754e59ed4a2158/shim.sock" debug=false module="containerd/tasks" pid=2290
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/536c3eae39192936e649668192e0daf6efba70cc60d1822cd0e8e2e593ebac40/shim.sock" debug=false module="containerd/tasks" pid=2300
Jul 18 14:28:58 MASTER_SERVER kernel: IPVS: Creating netns size=2048 id=305
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455]:  (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283086156+02:00" level=error msg="Failed to delete firewall mark rule in sbox fd62ef8 (7801a00): reexec failed: exit status 5"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283246626+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox b828388 (4152113): no such process"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283277874+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox b828388 (4152113): no such process"
Jul 18 14:28:58 MASTER_SERVER kernel: IPVS: Creating netns size=2048 id=306

当我使用docker service ps qbrek329yydx检查服务时 我看到以前的容器具有所需的状态“关闭”。当我检查停止的容器时,它们要么以状态0或状态137退出

状态0:

    "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2019-07-18T12:29:03.540071095Z",
        "FinishedAt": "2019-07-18T14:32:57.229648337Z"
    }

状态137:

    "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 137,
        "Error": "",
        "StartedAt": "2019-07-18T12:29:00.047740568Z",
        "FinishedAt": "2019-07-18T14:33:07.497580769Z",
        "Health": {
            "Status": "unhealthy",
            "FailingStreak": 0,

状态为137的人是正常的,因为如果RabbitMQ停止运行,Spring将报告其运行状况不佳,并且Docker将尝试重新启动。容器没有被OOM杀死,内部的应用程序没有崩溃,Docker只是决定重新启动一切。

我们还部署了没有群集的容器,这些容器从不重启,并且长期以来仍在运行。只有那些使用docker服务创建的容器才会重新启动。这也表明docker守护程序也没有重新启动。

我们首先假设这是一个网络问题(导致RabbitMQ失败并且我们的运行状况检查导致重新启动,但是即使是没有运行状况检查的filebeat容器也重新启动了),与此同时,我们应用了SACK Kernel紧急攻击缓解措施,并且these fixes,但没有运气。

我们仍然有足够的磁盘空间,所以这也不是问题。我们的内存不足,但是没有什么会导致崩溃的

              total        used        free      shared  buff/cache   available
Mem:            15G         11G        1.3G        183M        2.5G        3.2G
Swap:          4.0G        258M        3.7G

docker info的输出是

Containers: 54
 Running: 14
 Paused: 0
 Stopped: 40
Images: 48
Server Version: 18.03.1-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 0bf0j7rcw6xg7pda5lka7jyy8
 Is Manager: true
 ClusterID: 0e4yg9p17skhonx5jkaulhrsi
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.242.171.24
 Manager Addresses:
  10.242.171.24:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-957.12.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.5GiB
Name: OUR_MASTER_SERVER
ID: OZOG:AM6N:XGRS:KZMV:BOHR:QGAX:625X:AFDW:F6AV:P7ZF:FS4R:TCZL
Docker Root Dir: /opt/hpc/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: bridge-nf-call-ip6tables is disabled

有人知道我们为什么要重新启动吗?

3 个答案:

答案 0 :(得分:0)

由于heartbeat to manager { } failed,我认为问题与您的网络连接性有关。

 Managers: 1
 Nodes: 2

如果确实如此,那么您只有一个管理者节点和两个工作者节点?
重新连接到管理器节点以满满服务定义后,管理器节点将重新启动(我想也是)所有容器。

对于生产性工作负载,您至少应具有三个docker swarm管理器。这将使群集更具容错能力。 当您的集群仅包含三个节点时,请将两个辅助节点也提升为管理者。

答案 1 :(得分:0)

因此,我做了很多挖掘工作,将生产节点置于调试模式并检查了日志。 在调试模式下,Docker每隔1-2秒输出一条日志消息,我注意到日志中存在多个间隔,其中一些间隔为5秒长,但有时20秒钟以上我都不会收到任何日志消息,之后Docker会考虑主节点(本身)不稳定,它迫使自身重新启动,从而拉动其他所有实例(拥有一个主节点的缺点)。

将我们的VM迁移到另一台服务器后,该问题再也没有发生,并且再次正常运行。我们已经订购了第3个VM,因此我们可以与3个主服务器一起运行,以便我们能够在单个主服务器故障中幸免。

答案 2 :(得分:0)

对我来说是我在运行 Docker Swarm 的虚拟机中的备份程序。备份例程将服务切断了几秒钟,屏幕中弹出的一堆日志 + 连接到 VM 中运行的此服务的服务已断开连接。