dockerd是否支持WatchdogSec sd_notify健康检查?

时间:2019-08-08 00:05:58

标签: docker systemd hang watchdog

我们一直遇到Docker守护程序偶尔停止在我们的Kubernetes系统之一上响应的问题,但Systemd仍然认为该服务正在运行:

systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-04-15 20:40:57 UTC; 3 months 22 days ago
     Docs: https://docs.docker.com
 Main PID: 1281 (dockerd)
    Tasks: 1409
   Memory: 31.0G
      CPU: 5d 17h 3min 4.758s
   CGroup: /system.slice/docker.service
           ├─ 1281 /usr/bin/dockerd -H fd://
...

journalctl -u docker或syslog文件中没有任何内容可以指示问题所在,但是Docker守护程序不再响应请求(docker ps挂起)。当前,我们正在为Ubuntu 16.04使用17.03.2~ce-0~ubuntu-xenial软件包,该软件包具有以下服务单元:

cat /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket firewalld.service
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd://
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target

我注意到,即使它是Type=notify服务,在服务单元中也没有定义WatchdogSec=

Docker守护程序是否支持为基于sd_notify的运行状况检查设置看门狗超时?

1 个答案:

答案 0 :(得分:0)

,当前components/engine/cmd/dockerd/daemon_linux.go文件仅实现systemdDaemon.SdNotifyReady来在进程开始时通知Systemd。为了获得看门狗支持,必须使用SdWatchdogEnabled之类的东西来连续发送SdNotifyWatchdog = "WATCHDOG=1"通知。

如果您尝试在docker.service文件上设置WatchdogSec=60s,它将终止并重启服务,因为守护程序未发送所需的通知。

systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-08-08 02:09:52 UTC; 50s ago

systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: deactivating (stop-sigabrt) (Result: watchdog) since Thu 2019-08-08 02:10:02 UTC; 45ms ago

systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: activating (start) since Thu 2019-08-08 02:10:04 UTC; 777ms ago

# Log entries:
Aug 08 02:09:14 kam1 systemd[1]: Starting Docker Application Container Engine...
Aug 08 02:09:15 kam1 systemd[1]: Started Docker Application Container Engine.
Aug 08 02:10:15 kam1 systemd[1]: docker.service: Watchdog timeout (limit 60s)!
Aug 08 02:10:15 kam1 systemd[1]: docker.service: Killing process 12383 (dockerd) with signal SIGABRT.
Aug 08 02:10:16 kam1 systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 08 02:10:16 kam1 systemd[1]: docker.service: Failed with result 'watchdog'.
Aug 08 02:10:18 kam1 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Aug 08 02:10:18 kam1 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Aug 08 02:10:18 kam1 systemd[1]: Stopped Docker Application Container Engine.
Aug 08 02:10:18 kam1 systemd[1]: Starting Docker Application Container Engine...