我有一个自定义systemd服务,该服务使用inotify
扫描文件系统并在发生某些事件时创建文件。
该服务可以正常运行很多天,有时甚至可以持续数周。然后突然停止了。它配置为使用Restart=always
,因此我希望该服务在发生故障时能够自动恢复,但这并没有发生。
我想知道如何确定服务为何无法自我恢复以及如何解决该问题。
这是服务配置:
[Unit]
Description=Sets a PID limit (pids.max) for each container in the docker host
After=docker.service
Wants=docker.service
[Service]
Type=simple
Restart=always
StartLimitInterval=0
RestartSec=5
ExecStart=/opt/scripts/container-pid-limit.sh
StandardError=journal
文件/opt/scripts/container-pid-limit.sh
的内容
#!/bin/bash -x
MAX_PIDS=5000
CGROUPS_DIR=/sys/fs/cgroup/pids/docker/
CONTAINERS_DIR=/srv/docker_root/containers/
set_limit() {
limit=$(grep -ir label $CONTAINERS_DIR/$1/config.v2.json | jq -r '.Config.Labels["com.xyz.pid_limit"]')
if [[ ! $limit -gt 0 ]] ; then
limit=$MAX_PIDS
fi
echo "CONTAINER: $c LIMIT $limit FILE $f"
echo $limit > $f;
}
# set pids.max for already created containers
for f in $(find $CGROUPS_DIR -mindepth 2 -name pids.max); do
c=$(dirname $f | xargs basename)
set_limit $c
done
# monitor cgroup dir for newly created dirs
inotifywait --event create,isdir --monitor --quiet --format "%w%f" $CGROUPS_DIR | while read -r line; do
c=$(basename $line)
set_limit $c
done
systemctl status
在失败前的示例输出:
● container-pid-limit.service - Sets a PID limit (pids.max) for each container in the docker host
Loaded: loaded (/etc/systemd/system/container-pid-limit.service; static; vendor preset: enabled)
Active: active (running) since Wed 2019-06-05 08:44:38 UTC; 14min ago
Main PID: 277527 (container-pid-l)
Tasks: 3
Memory: 2.3M
CPU: 79ms
CGroup: /system.slice/container-pid-limit.service
├─277527 /bin/bash /opt/scripts/container-pid-limit.sh
├─277892 inotifywait --event create,isdir --monitor --quiet --format %w%f /sys/fs/cgroup/pids/docker/
└─277893 /bin/bash /opt/scripts/container-pid-limit.sh
失败后的systemctl status
的示例输出:
● container-pid-limit.service - Sets a PID limit (pids.max) for each container in the docker host
Loaded: loaded (/etc/systemd/system/container-pid-limit.service; static; vendor preset: enabled)
Active: inactive (dead)
编辑:我正在尝试使用systemctl status
和systemctl show
来确定服务何时启动并最终停止,但是在我看来,当服务失败时,所有历史记录都将丢失:>
systemctl show
的示例输出:
Type=simple
Restart=always
NotifyAccess=none
RestartUSec=5s
TimeoutStartUSec=1min
TimeoutStopUSec=45s
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
FailureAction=none
PermissionsStartOnly=no
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
MainPID=0
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ExecMainStartTimestampMonotonic=0
ExecMainExitTimestampMonotonic=0
ExecMainPID=0
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/opt/scripts/container-pid-limit.sh ; argv[]=/opt/scripts//container-pid-limit.sh ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
MemoryCurrent=18446744073709551615
CPUUsageNSec=18446744073709551615
TasksCurrent=18446744073709551615
Delegate=no
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
UMask=0022
LimitCPU=18446744073709551615
LimitCPUSoft=18446744073709551615
LimitFSIZE=18446744073709551615
LimitFSIZESoft=18446744073709551615
LimitDATA=18446744073709551615
LimitDATASoft=18446744073709551615
LimitSTACK=18446744073709551615
LimitSTACKSoft=8388608
LimitCORE=18446744073709551615
LimitCORESoft=0
LimitRSS=18446744073709551615
LimitRSSSoft=18446744073709551615
LimitNOFILE=4096
LimitNOFILESoft=1024
LimitAS=18446744073709551615
LimitASSoft=18446744073709551615
LimitNPROC=7869937
LimitNPROCSoft=7869937
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536
LimitLOCKS=18446744073709551615
LimitLOCKSSoft=18446744073709551615
LimitSIGPENDING=7869937
LimitSIGPENDINGSoft=7869937
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=18446744073709551615
LimitRTTIMESoft=18446744073709551615
OOMScoreAdjust=0
Nice=0
IOScheduling=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardOutput=journal
StandardError=journal
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SyslogLevel=6
SyslogFacility=3
SecureBits=0
CapabilityBoundingSet=18446744073709551615
AmbientCapabilities=0
MountFlags=0
PrivateTmp=no
PrivateNetwork=no
PrivateDevices=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
UtmpMode=init
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
RuntimeDirectoryMode=0755
KillMode=control-group
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=container-pid-limit.service
Names=container-pid-limit.service
Requires=sysinit.target system.slice
Wants=docker.service
Conflicts=shutdown.target
Before=shutdown.target
After=basic.target systemd-journald.socket system.slice docker.service sysinit.target
Description=Sets a PID limit (pids.max) for each container in the docker host
LoadState=loaded
ActiveState=inactive
SubState=dead
FragmentPath=/etc/systemd/system/container-pid-limit.service
UnitFileState=static
UnitFilePreset=enabled
StateChangeTimestampMonotonic=0
InactiveExitTimestampMonotonic=0
ActiveEnterTimestampMonotonic=0
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
NeedDaemonReload=no
JobTimeoutUSec=infinity
JobTimeoutAction=none
ConditionResult=no
AssertResult=no
ConditionTimestampMonotonic=0
AssertTimestampMonotonic=0
Transient=no
StartLimitInterval=0
StartLimitBurst=5
StartLimitAction=none
答案 0 :(得分:0)
SystemD始终重新启动并不意味着是循环。这是关于失败处理的。
StartLimitBurst上的RTFM