Question

我正在使用monit来监控我的程序。受监控的程序可能会在2种情况下崩溃

程序可以随机崩溃。它只需要重新启动
每次启动后都会陷入糟糕的状态并崩溃

为了解决后一种情况，我有一个脚本来停止程序，通过清理其数据文件并重新启动它将其重置为良好状态。我尝试了下面的配置

check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout

假设monit在3个周期内重启程序3次。第三次重启后， cleanProgramAndRestart 脚本运行。但是，当cleanProgramAndRestart脚本再次重新启动程序时，在下一个循环中再次满足3次重启的条件并且它变为无限循环

有人可以建议任何解决方法吗？

如果可以采取以下任何行动，那么可能有办法解决。

如果存在“崩溃”关键字，而不是“重新启动”，我将能够在程序崩溃 3次后运行干净脚本，而不是在重新启动后< / strong> 3次

如果有办法在运行exec脚本后以某种方式重置“重启”计数器

如果只有在条件 3重新开始
的输出发生变化时才有办法执行某事

Answer 1

Monit正在调查你的＆＃34;测试＆＃34;每个周期。周期长度通常在/etc/monitrc set daemon cycle_length中定义

因此，如果您的cleanProgramAndRestart执行时间不到一个周期，则不应该发生。正如它发生的那样，我猜你的cleanProgramAndRestart需要不止一个周期才能完成。

你可以：

在Monit配置中增加周期长度
每x个循环检查一次你的程序（确保cycle_length * x＆gt; cleanProgramAndRestart_length）

如果您无法修改这些变量，可能会有一些解决方法，使用临时文件：

check process program 
  with pidfile program.pid
  start program = "programStart" 
    as uid username and gid groupname
  stop program = "programStop" 
    as uid username and gid groupname
  if 3 restarts within 20 cycles 
  then exec "touch /tmp/program__is_crashed" 
  if 6 restarts within 20 cycles then timeout

check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
  if changed timestamp then exec "cleanProgramAndRestart"
    as uid username and gid groupname

Monit - 如何识别程序的崩溃而不是重新启动

1 个答案: