Question

我有一个使用以下内容重启的python脚本：

python = sys.executable
os.execl(python, python, * sys.argv)

大部分时间这种方法都可以正常工作，但偶尔会重启失败，并且没有名为error的模块。例子：

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
File "/usr/lib/python2.7/posixpath.py", line 17, in <module>
import warnings
File "/usr/lib/python2.7/warnings.py", line 6, in <module>
import linecache
ImportError: No module named linecache

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
 File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
 File "/usr/lib/python2.7/posixpath.py", line 15, in <module>
import stat   
ImportError: No module named stat

编辑：我按照andr0x的建议尝试了gc.collect（），但这不起作用。我得到了同样的错误：

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
ImportError: No module named posixpath

编辑2：我尝试了sys.stdout.flush()，我仍然遇到同样的错误。我注意到在发生错误之前，我只会在1-3次成功重启之间获得。

Answer 1

我相信你会遇到以下错误：

http://bugs.python.org/issue16981

由于这些模块不太可能正在消失，因此必然会出现另一个实际出错的错误。错误报告列出了“太多打开的文件”，因为它容易导致此问题，但我不确定是否还有其他错误也会触发此问题。

我会确保你在点击重启代码之前关闭任何文件句柄。您还可以使用以下命令强制垃圾收集器手动运行：

import gc
gc.collect()

http://docs.python.org/2/library/gc.html

您也可以在点击重启代码之前尝试使用

Answer 2

如果问题是太多文件被打开，那么你必须在文件描述符上设置FD_CLOEXEC标志，以便在exec发生时关闭它们。这是一段代码，模拟在重新加载时达到文件描述符限制，并包含一个未达到限制的修复程序。如果您想模拟崩溃，请将fixit设置为False。当fixit为True时，代码会遍历文件描述符列表并将其设置为FD_CLOEXEC。这适用于Linux。在没有/proc/<pid>/fd/的系统上工作的人必须找到一种适合系统的方法来列出打开的文件描述符。这question可能会有所帮助。

import os
import sys
import fcntl

pid = str(os.getpid())

def fds():
    return os.listdir(os.path.join("/proc", pid, "fd"))

files = []

print "Number of files open at start:", len(fds())

for i in xrange(0, 102):
    files.append(open("/dev/null", 'r'))

print "Number of files open after going crazy with open()", len(fds())

fixit = True
if fixit:
    # Cycle through all file descriptors opened by our process.
    for f in fds():
        fd = int(f)
        # Transmit the stds to future generations, mark the rest as close-on-exec.
        if fd > 2:  .
            try:
                fcntl.fcntl(fd, fcntl.F_SETFD, fcntl.FD_CLOEXEC)
            except IOError:
                # Some files can be closed between the time we list
                # the file descriptors and now. Most notably,
                # os.listdir opens the dir and it will probably be
                # closed by the time we hit that fd.
                pass

print "reloading"
python = sys.executable
os.execl(python, python, *sys.argv)

使用此代码，我在stdout上得到的是这3行重复，直到我终止进程：

Number of files open at start: 4
Number of files open after going crazy with open() 106
reloading

代码如何工作

上面的代码通过fds()函数获取打开文件描述符的列表。在Linux系统上，由特定进程打开的文件描述符列在：

/proc/<process id of the process we want>/fd

因此，如果您的流程的流程ID为100，那么您执行以下操作：

$ find /proc/100/fd

您将获得如下列表：

/proc/100/fd/0
/proc/100/fd/1
/proc/100/fd/2
[...]

fds()函数只获取所有这些文件的基名["0", "1", "2", ...]。（更通用的解决方案可能会立即将它们转换为整数。我选择不这样做。）

第二个关键部分是在除FD_CLOEXEC之外的所有文件描述符上设置std{in,out,err}。在文件描述符上设置FD_CLOEXEC告诉操作系统下次执行exec时，操作系统应该在控制下一个可执行文件之前关闭文件描述符。此标志在fcntl的手册页上定义。

在使用打开文件的线程的应用程序中，上面的代码可能会错过在某些文件描述符上设置FD_CLOEXEC，如果线程在列表之间执行获取文件描述符并且时间exec被称为并且此线程打开新文件。我相信确保不会发生这种情况的唯一方法是用调用库存os.open的代码替换os.open，然后立即在返回的文件描述符上设置FD_CLOEXEC。

Answer 3

不是一个真正的答案，只是针对您的实际问题的解决方法：您是否考虑过启动子进程，如果这个立即终止，那么尝试启动另一个进程？这有一些影响，如不断变化的PID，但也许你可以忍受。

而不是

python = sys.executable
os.execl(python, python, * sys.argv)

你可以使用

import time, os

MONITOR_DURATION = 3.0
# ^^^ time in seconds we monitor our child for terminating too soon

python = sys.executable
while True:  # until we have a child which survived the monitor duration
  pid = os.fork()  # splice this process into two
  if pid == 0:  # are we the child process?
    os.execl(python, python, *sys.argv)  # start this program anew
  else:  # we are the father process
    startTime = time.time()
    while startTime + MONITOR_DURATION > time.time():
      exitedPid, status = os.waitpid(pid, os.WNOHANG)
      # ^^^ check our child for being terminted yet
      #     (without really waiting for it, due to WNOHANG)
      if exitedPid == pid:  # did our child terminate too soon?
        break
      else:  # no, nothing terminated yet
        time.sleep(0.2)  # wait a little before testing child again
    else:  # we survived the monitor duration without reaching a "break"
      break  # so we have a good running child, leave the outer loop

使用os.execl（）重新加载时没有名为'x'的模块

3 个答案:

代码如何工作