我最近创建了一个用于服务检查的事件处理程序,它将在3个不同的框上重新启动Tomcat。
检查设置为:
5张支票
2分钟检查一次,确定
否则5分钟检查一次
在事件处理程序脚本中,我有:
# What state is the iOS PN in?
case "$1" in
OK)
# The service is ok, so don't do anything...
;;
WARNING)
# Is this a "soft" or a "hard" state?
case "$2" in
SOFT)
case "$3" in
#Check number
2)
echo "`date` Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)..." >> /tmp/iOSPN.log
;;
3)
echo "`date` Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)..." >> /tmp/iOSPN.log
;;
4)
echo "`date` Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)..." >> /tmp/iOSPN.log
;;
esac
;;
HARD)
# Do nothing let Nagios send alert
;;
esac
;;
CRITICAL)
# In theory nothing should reach this point...
;;
esac
exit 0
因此,事件处理程序应在第二次警告检查之后在节点1上重新启动Tomcat,等待5分钟再重新检查,如果仍然存在问题,则重新启动节点2,然后等待5分钟,然后再次检查,然后重新启动节点3。仍然是一个问题。
但是,当我检查日志文件时,可以看到以下内容:
Thu Apr 18 15:09:13 2019 Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)...
Thu Apr 18 15:09:23 2019 Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)...
Thu Apr 18 15:09:33 2019 Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)...
如您所见,它将在10秒而不是5分钟后重新启动每个框,我删除了实际调用Tomcat重新启动的行,因为这无法在短时间内完成。
我无法在Nagios日志中看到任何细节,详细说明了为什么它如此迅速地进行了下一次检查,因此将不胜感激。
其他:
这是服务定义:
define service{
use 5check-service
host_name ACTIVEMQ1
contact_groups tyrell-admins-non-critical
service_description ActiveMQ - iOS PushNotification Queue Pending Items
event_handler restartRemote_Tomcat!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
check_command check_activemq_queue_item2!http://activemq1:8161/admin/xml/queues.jsp!IosPushNotificationQueue!100!300
}
define service{
name 5check-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 5 ; Re-check the service up to 5 times in order to determine its final (hard) state
normal_check_interval 2 ; Check the service every 5 minutes under normal conditions
retry_check_interval 5 ; Re-check the service every two minutes until a hard state can be determined
contact_groups support ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every 5 mins
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}