Question

我有一个复杂的python管道（我无法更改代码），调用多个其他脚本和其他可执行文件。关键是运行超过8000个目录需要很长时间，进行一些科学分析。所以，我使用多处理模块编写了一个简单的包装器（可能不是最有效，但似乎工作）。

from os import path, listdir, mkdir, system
from os.path import join as osjoin, exists, isfile
from GffTools import Gene, Element, Transcript
from GffTools import read as gread, write as gwrite, sort as gsort
from re import match
from multiprocessing import JoinableQueue, Process
from sys import argv, exit

# some absolute paths
inbase = "/.../abfgp_in"
outbase = "/.../abfgp_out"
abfgp_cmd = "python /.../abfgp-2.rev/abfgp.py"
refGff = "/.../B0510_manual_reindexed_noSeq.gff"

# the Queue
Q = JoinableQueue()
i = 0

# define number of processes
try: num_p = int(argv[1])
except ValueError: exit("Wrong CPU argument")

# This is the function calling the abfgp.py script, which in its turn calls alot of third party software
def abfgp(id_, pid):
    out = osjoin(outbase, id_)
    if not exists(out): mkdir(out)

    # logfile
    log = osjoin(outbase, "log_process_%s" %(pid))
    try:
        # call the script
        system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))
    except:
        print "ABFGP FAILED"
        return

# parse the output
def extractGff(id_):
   # code not relevant


# function called by multiple processes, using the Queue
def run(Q, pid):
    while not Q.empty():
        try:
            d = Q.get()             
            print "%s\t=>>\t%s" %(str(i-Q.qsize()), d)          
            abfgp(d, pid)
            Q.task_done()
        except KeyboardInterrupt:
            exit("Interrupted Child")

# list of directories
genedirs = [d for d in listdir(inbase)]
genes = gread(refGff)
for d in genedirs:
    i += 1
    indir = osjoin(inbase, d)
    outdir = osjoin(outbase, d)
    Q.put(d)

# this loop creates the multiple processes
procs = []
for pid in range(num_p):
    try:
        p = Process(target=run, args=(Q, pid+1))
        p.daemon = True
        procs.append(p) 
        p.start()
    except KeyboardInterrupt:
        print "Aborting start of child processes"
        for x in procs:
            x.terminate()
        exit("Interrupted")     

try:
    for p in procs:
        p.join()
except:
    print "Terminating child processes"
    for x in procs:
        x.terminate()
    exit("Interrupted")

print "Parsing output..."
for d in genedirs: extractGff(d)

现在问题是，abfgp.py使用os.chdir函数，这似乎破坏了并行处理。我收到很多错误，说明无法读取/写入某些（输入/输出）文件/目录。即使我通过os.system（）调用脚本，但是我会从中生成单独的进程来防止这种情况。

我如何解决这些chdir干扰？

编辑：我可能会使用正确的目录将os.system（）更改为subprocess.Popen（cwd =“...”）。我希望这会有所作为。

感谢。

Answer 1

修改2

请勿使用os.system()使用subprocess.call()

system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))

会转换为

subprocess.call((abfgp_cmd, '--dna', osjoin(inbase, id_, id_ +".dna.fa"), '--multifasta', osjoin(inbase, id_, "informants.mfa"), '--target', id_, '-o', out, '-q')) # without log.

修改1 我认为问题在于多处理是使用模块名来序列化函数，类。

这意味着，如果您import module ./module.py模块位于os.chdir('./dir')，而您现在执行from .. import module，则需要os.getcwd()。

子进程继承父进程的文件夹。这可能是个问题。

<强>解决方案

确保导入所有模块（在子进程中），然后更改目录
将原始sys.path插入site-packages以启用从原始目录导入。必须在从本地目录调用任何函数之前完成此操作。
将您使用的所有函数放在一个可以随时导入的目录中。 import module可以是这样的目录。然后，您可以执行module.main() serialized # the function runD is serialized string executed # before the function is loaded the code is executed loaded # now the function run is deserialized run # run is called之类的操作来开始您的工作。

这是我做的一个黑客，因为我知道泡菜是如何工作的。仅在其他尝试失败时才使用此选项。脚本打印：

runD = evalBeforeDeserialize('__import__("sys").path.append({})'.format(repr(os.getcwd())), run)
p = Process(target=runD, args=(Q, pid+1))

在你的情况下，你会做这样的事情：

# functions that you need

class R(object):
    def __init__(self, call, *args):

        self.ret = (call, args)
    def __reduce__(self):
        return self.ret
    def __call__(self, *args, **kw):
        raise NotImplementedError('this should never be called')

class evalBeforeDeserialize(object):
    def __init__(self, string, function):
        self.function = function
        self.string = string
    def __reduce__(self):
        return R(getattr, tuple, '__getitem__'), \
                 ((R(eval, self.string), self.function), -1)

# code to show how it works        

def printing():
    print('string executed')

def run():
    print('run')

runD = evalBeforeDeserialize('__import__("__main__").printing()', run)

import pickle

s = pickle.dumps(runD)
print('serialized')
run2 = pickle.loads(s)
print('loaded')
run2()

这是剧本：

{{1}}

如果这些不起作用，请报告。

Answer 2

您可以确定不可更改的程序正在使用的os库的哪个实例;然后在该库中创建一个定制版本的chdir，以满足您的需要 - 防止目录更改，记录，无论如何。如果定制的行为只需要针对单个程序，则可以使用inspect模块来识别调用者，并以特定方式为该调用者定制行为。

如果您真的无法改变现有程序，您的选择是有限的;但如果您可以选择更改它导入的库，那么这样的东西可能是一种阻碍不良行为的最不具侵入性的方法。

在更改标准库时，通常需要注意事项。

多个python进程之间的os.chdir

2 个答案: