nagios发送通知直到确认被动检查

时间:2013-12-05 20:03:20

标签: nagios

我有一个设置,其中nagios从设备接收snmp陷阱。然后它通知config.cfg中定义的联系人。这很棒。我想要完成的是,如果问题在给定的时间内没有确认,那么nagios会发送另一个通知。我不能让nagios发送第二个通知。我正在使用外部命令来实际调用通知作为通知,一切正常。我没有看到nagios试图发出第二次通知。

我将所有配置文件剪切为1个配置文件,以便于阅读。

  #TIMEPERIODS


  define timeperiod{
    timeperiod_name 24x7
    alias           24 Hours A Day, 7 Days A Week
    sunday          00:00-24:00
    monday          00:00-24:00
    tuesday         00:00-24:00
    wednesday       00:00-24:00
    thursday        00:00-24:00
    friday          00:00-24:00
    saturday        00:00-24:00
    }

  #SERVICES


  ##handle the trap


  define service{
    host_name                       serverName
    service_description             TRAP
    is_volatile                     1
    check_command                   check-host-alive
    max_check_attempts              3
    normal_check_interval           1
    retry_check_interval            1
    active_checks_enabled           0
    passive_checks_enabled          1
    check_period                    24x7
    notification_interval           1
    notification_period             24x7
    notification_options            w,u,c
    notifications_enabled           1
    contact_groups                  admins
    }

  #COMMANDS

  define command{
    command_name    check-host-alive
    command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
    }

  define command{
    command_name  notify-host-by-sip
    command_line /usr/lib64/nagios/plugins/calls/makeCall "$NOTIFICATIONTYPE$"
  }

  define command{
    command_name notify-service-by-sip
    command_line /usr/lib64/nagios/plugins/calls/makeCall "$NOTIFICATIONTYPE$"


  }



  #CONTACT_GROUPS

  define contactgroup{
    contactgroup_name       admins
    alias                   Nagios Administrators
    members                 user_sip
    }

  #CONTACTS 

  define contact{
    contact_name  user_sip
    alias  useralias
    service_notification_period  24x7
    host_notification_period  24x7
    service_notification_options  w
    host_notification_options  d
    service_notification_commands notify-service-by-sip
    host_notification_commands  notify-host-by-sip
    email  someNumber@someServer
  }

  #HOSTS

  define host{
    host_name                       localhost
    alias                           Development
    address                         serverIP
    max_check_attempts              5
    check_period                    24x7
    contact_groups                  admins
    notification_period             24x7
    }

  define host{
    host_name                      serverName
    alias                           Development
    address                         someIP
    max_check_attempts              5
    check_period                    24x7
    contact_groups                  admins
    notification_period             24x7
    }

被动检查的结果

 [1386274600] PASSIVE SERVICE CHECK: localhost;TRAP;1;TRAP trap received
 [1386274600] SERVICE ALERT: localhost;TRAP;WARNING;HARD;1;TRAP trap received
 [1386274600] SERVICE NOTIFICATION: user_sip;localhost;TRAP;WARNING;notify-service-by-sip;TRAP trap received

然后没有任何事情......

3 个答案:

答案 0 :(得分:3)

在查看Nagios来源后,我可以说:

  • 只有在处理完检查结果(主动或被动)后,才会发生通知。如果您尝试将'notification_interval'设置为小于'check_interval'的值,它将在启动时发出警告。

  • 如果将'is_volatile'设置为'1',则会忽略树中的所有'notification_interval'选项。这基本上意味着,每次检查失败时发送通知。但是在发送通知之前,检查仍然必须失败。

因此,如果被动检查不是主动投掷非OK结果,则不会获得持续的警报流。

解决此问题的方法是创建一个“事件处理程序”脚本:

  1. 检查Nagios宏$ NOTIFICATIONTYPE $是否不等于“ACKNOWLEDGMENT”(您必须确保启用了ACK通知)。
  2. 如果满足以上条件,请休眠59秒,然后使用新的纪元时间戳将相同的被动检查失败重新提交至Nagios外部命令文件。
  3. 这应该保持警报,直到有人确认。

答案 1 :(得分:1)

我已按照您的建议创建了一个脚本,但是当清除陷阱进入时会发生什么事情仍有一个休眠事件计时器,它会重新提交先前失败的陷阱。此外,nagios事件处理程序会在有和没有&

的情况下挂起睡眠计时器的事件
#!/bin/sh
SERVICESTATE=$1
HOSTADDRESS=$2
SERVICEDESC=$3
SERVICEOUTPUT=$4
NOTIFICATIONTYPE=$5

if [ $NOTIFICATIONTYPE == 'ACKNOWLEDGEMENT' ]
then exit
fi
case "$SERVICESTATE" in
        OK)
                rm /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                exit
                ;;
        WARNING)
                touch /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                sleep 45
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                sh /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 1 $SERVICEOUTPUT
                fi
                exit
                ;;
        CRITCAL)
                touch /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                sleep 45
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                sh /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 2 "$SERVICEOUTPUT"
                fi
                exit
                ;;
        UNKNOWN)
                exit
                ;;
esac

答案 2 :(得分:0)

围绕abit更改它,让我的snmp陷阱处理程序执行一个触及文件的脚本,如果该文件存在,它会将检查结果重新发送到nagios。

我使用这些来延迟第一次通知,因此快速的问题和恢复不会在夜间唤醒我并重复通知。由于Nagios决定是否应该通知,如果您承认该问题,您将不再收到电子邮件。

原谅我糟糕的编码,我学会了解决这个问题......:P

pass2.sh

#!/bin/sh
# VER 2 
HOSTADDRESS=$1
SERVICEDESC=$2
SERVICESTATE=$3
SERVICEOUTPUT=$4

case "$SERVICESTATE" in
        0)
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                rm /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 0 "$4"
                fi
                exit
                ;;
        1)
                touch /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 1 "$4" &
                /usr/local/libexec/nagios/passtron.sh $HOSTADDRESS $SERVICEDESC 1 "$4" &
                fi
                exit
                ;;
        2)
                touch /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 2 "$4" &
                /usr/local/libexec/nagios/passtron.sh $HOSTADDRESS $SERVICEDESC 2 "$4" &
                fi
                exit
                ;;
        UNKNOWN)
                exit
                ;;
esac

passtron.sh

#!/bin/sh
# PASSTRON
HOSTADDRESS=$1
SERVICEDESC=$2
SERVICESTATE=$3
SERVICEOUTPUT=$4

case "$SERVICESTATE" in
        1)
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                sleep 60
                fi
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 1 "$4" &
                fi
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/passtron.sh $HOSTADDRESS $SERVICEDESC 1 "$4" &
                fi
                exit
                ;;
        2)
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                sleep 60
                fi
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/submit_check_result $HOSTADDRESS $SERVICEDESC 2 "$4" &
                fi
                if [ -f /usr/local/libexec/nagios/pass/$HOSTADDRESS.$SERVICEDESC.bad ]
                then
                /usr/local/libexec/nagios/passtron.sh $HOSTADDRESS $SERVICEDESC 2 "$4" &
                fi
                exit
                ;;

esac

以及我的一项服务

的示例
define service{
 hostgroup_name         trapdevices
 use                    trap-template
 service_description    TEMP-ALARM
 contact_groups          oncall-tech
 is_volatile 0
 normal_check_interval 1
 retry_check_interval  1
 max_check_attempts    1
 active_checks_enabled 0
 notification_interval 300
 first_notification_delay 5