Question

我正在运行Nagios 3.2.3，我对主机检查有一个神秘的问题。这是一个示例主机定义。

define host {
  host_name HOST
  contacts CONTACTS_HERE
  alias ALIAS
  max_check_attempts 15
  check_interval 5
  active_checks_enabled 1
  passive_checks_enabled 1
  check_period 24x7
  obsess_over_host 0
  retry_interval 1
  check_freshness 0
  freshness_threshold 120
  retain_status_information 1
  retain_nonstatus_information 1
  low_flap_threshold 0
  high_flap_threshold 0
  flap_detection_enabled 0
  process_perf_data 1
  notification_interval 120
  notification_period 24x7
  notification_options d,u,r
  check_command check-host-alive
  icon_image_alt Linux
  icon_image linux40.png
  statusmap_image linux40.gd2
}

如您所见，max_check_attempts设置为15，retry_interval设置为1分钟。 check命令如下所示：

define command {
  command_name check-host-alive
  command_line /usr/lib64/nagios/plugins/check_ping -H $HOSTNAME$ -w 3000.0,80% -c 5000.0,100% -p 1
}

然而，这一系列事件会发生什么：

Host Up[01-30-2017 21:41:56] HOST ALERT: HOST_NAME;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.17 ms
Host Down[01-30-2017 21:41:21] HOST ALERT: HOST_NAME;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Host Down[01-30-2017 21:41:10] HOST ALERT: HOST_NAME;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%

因此，在第一次检查失败后，主机进入硬状态而不是间隔1分钟检查15次。我应该补充说，这似乎发生在主机没有真正停机但非常忙碌时。

有什么想法吗？

谢谢，谢尔盖

使用主机检查忽略Nagios max_check_attempts /重试间隔

0 个答案: