我正在运行Nagios 3.2.3,我对主机检查有一个神秘的问题。这是一个示例主机定义。
define host {
host_name HOST
contacts CONTACTS_HERE
alias ALIAS
max_check_attempts 15
check_interval 5
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
obsess_over_host 0
retry_interval 1
check_freshness 0
freshness_threshold 120
retain_status_information 1
retain_nonstatus_information 1
low_flap_threshold 0
high_flap_threshold 0
flap_detection_enabled 0
process_perf_data 1
notification_interval 120
notification_period 24x7
notification_options d,u,r
check_command check-host-alive
icon_image_alt Linux
icon_image linux40.png
statusmap_image linux40.gd2
}
如您所见,max_check_attempts设置为15,retry_interval设置为1分钟。 check命令如下所示:
define command {
command_name check-host-alive
command_line /usr/lib64/nagios/plugins/check_ping -H $HOSTNAME$ -w 3000.0,80% -c 5000.0,100% -p 1
}
然而,这一系列事件会发生什么:
Host Up[01-30-2017 21:41:56] HOST ALERT: HOST_NAME;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.17 ms
Host Down[01-30-2017 21:41:21] HOST ALERT: HOST_NAME;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Host Down[01-30-2017 21:41:10] HOST ALERT: HOST_NAME;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
因此,在第一次检查失败后,主机进入硬状态而不是间隔1分钟检查15次。我应该补充说,这似乎发生在主机没有真正停机但非常忙碌时。
有什么想法吗?
谢谢, 谢尔盖