我有一些python自动化,它会生成我用linux script
命令记录的telnet
个会话;每个日志记录会话都有两个script
进程ID(父级和子级)。
我需要解决一个问题,如果python自动化脚本死掉,script
会话永远不会自行关闭;由于某种原因,这比它应该更难。
到目前为止,我已经实现了watchdog.py
(请参阅问题的底部),它自己进行守护,并在循环中轮询python自动化脚本的PID。当它看到python自动化PID从服务器的进程表中消失时,它会尝试终止script
个会话。
我的问题是:
script
个会话始终生成两个单独的进程,其中一个script
会话是另一个script
会话的父进程。script
个会话,则watchdog.py
不会终止子script
个会话(请参阅自动示例,下面)reproduce_bug.py
)import pexpect as px
from subprocess import Popen
import code
import time
import sys
import os
def read_pid_and_telnet(_child, addr):
time.sleep(0.1) # Give the OS time to write the PIDFILE
# Read the PID in the PIDFILE
fh = open('PIDFILE', 'r')
pid = int(''.join(fh.readlines()))
fh.close()
time.sleep(0.1)
# Clean up the PIDFILE
os.remove('PIDFILE')
_child.expect(['#', '\$'], timeout=3)
_child.sendline('telnet %s' % addr)
return str(pid)
pidlist = list()
child1 = px.spawn("""bash -c 'echo $$ > PIDFILE """
"""&& exec /usr/bin/script -f LOGFILE1.txt'""")
pidlist.append(read_pid_and_telnet(child1, '10.1.1.1'))
child2 = px.spawn("""bash -c 'echo $$ > PIDFILE """
"""&& exec /usr/bin/script -f LOGFILE2.txt'""")
pidlist.append(read_pid_and_telnet(child2, '10.1.1.2'))
cmd = "python watchdog.py -o %s -k %s" % (os.getpid(), ','.join(pidlist))
Popen(cmd.split(' '))
print "I started the watchdog with:\n %s" % cmd
time.sleep(0.5)
raise RuntimeError, "Simulated script crash. Note that script child sessions are hung"
现在举例说明当我运行上述自动化时会发生什么...请注意,PID 30017产生30018,PID 30020产生30021.所有上述PID都是script
个会话。
[mpenning@Hotcoffee Network]$ python reproduce_bug.py
I started the watchdog with:
python watchdog.py -o 30016 -k 30017,30020
Traceback (most recent call last):
File "reproduce_bug.py", line 35, in <module>
raise RuntimeError, "Simulated script crash. Note that script child sessions are hung"
RuntimeError: Simulated script crash. Note that script child sessions are hung
[mpenning@Hotcoffee Network]$
运行上述自动化后,所有子script
个会话仍在运行。
[mpenning@Hotcoffee Models]$ ps auxw | grep script
mpenning 30018 0.0 0.0 15832 508 ? S 12:08 0:00 /usr/bin/script -f LOGFILE1.txt
mpenning 30021 0.0 0.0 15832 516 ? S 12:08 0:00 /usr/bin/script -f LOGFILE2.txt
mpenning 30050 0.0 0.0 7548 880 pts/8 S+ 12:08 0:00 grep script
[mpenning@Hotcoffee Models]$
我在Debian Squeeze linux系统上运行Python 2.6.6下的自动化(uname -a:Linux Hotcoffee 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux
)。
似乎守护进程在产生进程崩溃时无法生存。如果自动化程序死亡(如上例所示),如何修复watchdog.py以关闭所有脚本会话?
说明问题的watchdog.py
日志(遗憾的是,PID与原始问题不一致)......
[mpenning@Hotcoffee ~]$ cat watchdog.log
2012-02-22,15:17:20.356313 Start watchdog.watch_process
2012-02-22,15:17:20.356541 observe pid = 31339
2012-02-22,15:17:20.356643 kill pids = 31352,31356
2012-02-22,15:17:20.356730 seconds = 2
[mpenning@Hotcoffee ~]$
问题基本上是竞争条件。当我试图杀死“父”script
进程时,他们已经死于与自动化事件同时发生......
要解决这个问题......首先,监视程序守护程序需要在轮询观察到的PID之前识别要杀死的整个子项列表(我的原始脚本试图在观察到的PID崩溃后识别子项)。接下来,我必须修改我的监视程序守护程序,以允许某些script
进程可能因观察到的PID而死亡。
<小时/> 的
watchdog.py:
#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills
other PIDs if the observed PID dies.
Example:
--------
watchdog.py -o 29322 -k 29345,29346,29348 -s 2
The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348
and their children, if PID 29322 dies.
Requires:
----------
* https://github.com/giampaolo/psutil
* http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os
class MyFormatter(logging.Formatter):
converter=dt.datetime.fromtimestamp
def formatTime(self, record, datefmt=None):
ct = self.converter(record.created)
if datefmt:
s = ct.strftime(datefmt)
else:
t = ct.strftime("%Y-%m-%d %H:%M:%S")
s = "%s,%03d" % (t, record.msecs)
return s
def check_pid(pid):
""" Check For the existence of a unix / windows pid."""
try:
os.kill(pid, 0) # Kill 0 raises OSError, if pid isn't there...
except OSError:
return False
else:
return True
def kill_process(logger, pid):
try:
psu_proc = psutil.Process(pid)
except Exception, e:
logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
return False
logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
psu_proc.send_signal(signal.SIGTERM)
psu_proc.wait(timeout=None)
return True
def watch_process(observe, kill, seconds=2):
"""Kill the process IDs listed in 'kill', when 'observe' dies."""
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
logger.addHandler(logfile)
formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
logfile.setFormatter(formatter)
logger.debug('Start watchdog.watch_process')
logger.debug(' observe pid = %s' % observe)
logger.debug(' kill pids = %s' % kill)
logger.debug(' seconds = %s' % seconds)
children = list()
# Get PIDs of all child processes...
for childpid in kill.split(','):
children.append(childpid)
p = psutil.Process(int(childpid))
for subpsu in p.get_children():
children.append(str(subpsu.pid))
# Poll observed PID...
while check_pid(int(observe)):
logger.debug('Poll PID: %s is alive.' % observe)
time.sleep(seconds)
logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))
for pid in children:
# kill all child processes...
kill_process(logger, int(pid))
sys.exit(0) # Exit gracefully
def run(observe, kill, seconds):
with daemon.DaemonContext(detach_process=True,
stdout=sys.stdout,
working_directory=os.getcwd()):
watch_process(observe=observe, kill=kill, seconds=seconds)
if __name__=='__main__':
parser = OptionParser()
parser.add_option("-o", "--observe", dest="observe", type="int",
help="PID to be observed", metavar="INT")
parser.add_option("-k", "--kill", dest="kill",
help="Comma separated list of PIDs to be killed",
metavar="TEXT")
parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
help="Seconds to wait between observations (default = 2)",
metavar="INT")
(options, args) = parser.parse_args()
run(options.observe, options.kill, options.seconds)
答案 0 :(得分:2)
你的问题是脚本在产生后没有从自动化脚本中分离出来,所以它作为孩子工作,当父母去世时,它仍然无法管理。
要处理python脚本退出,您可以使用atexit模块。 要监视子进程退出,您可以使用os.wait或处理SIGCHLD信号
答案 1 :(得分:1)
您可以尝试终止包含以下内容的完整流程群组:父{4}},孩子script
,script
由bash
生成而且 - 也许 - 甚至是script
进程。
telnet
手册说:
如果pid小于-1,则会将sig发送到ID为-pid的进程组中的每个进程。
因此相当于kill(2)
将完成这项工作。
哦,你需要的pid是父kill -TERM -$PID
。
修改强>
如果我在watchdog.py中调整以下两个函数,那么进程组查杀似乎对我有用:
script
答案 2 :(得分:0)
也许您可以使用os.system()并在监视程序中执行killall来杀死/ usr / bin / script的所有实例
答案 3 :(得分:0)
在检查时似乎是 psu_proc.kill()
(实际上send_signal()
)应该在失败时引发OSError
,但为了以防万一 - 您是否尝试过检查之前的终止设置国旗?如:
if not psu_proc.is_running():
finished = True