Question

我创建了一个默认创建一个多处理过程的脚本;然后它工作正常。当启动多个进程时，它开始挂起，并不总是在同一个地方。该程序大约有700行代码，因此我将总结一下正在发生的事情。我希望通过并行化最慢的任务来充分利用我的多核，这是对齐DNA序列。为此，我使用子进程模块来调用命令行程序：'hmmsearch'，我可以通过/ dev / stdin按顺序提供，然后我通过/ dev / stdout读出对齐的序列。我想因为从stdout / stdin读取/写入这些多个子进程实例而导致挂起，我真的不知道最好的方法... 我正在研究os.fdopen（...）＆amp; os.tmpfile（），用于创建临时文件句柄或管道，我可以在其中刷新数据。但是，我以前从未使用过和我无法想象如何使用子进程模块执行此操作。理想情况下，我想完全绕过使用硬盘驱动器，因为管道在高吞吐量数据处理方面要好得多！任何有关这方面的帮助都会非常精彩!!

import multiprocessing, subprocess
from Bio import SeqIO

class align_seq( multiprocessing.Process ):
    def __init__( self, inPipe, outPipe, semaphore, options ):
        multiprocessing.Process.__init__(self)
        self.in_pipe = inPipe          ## Sequences in
        self.out_pipe = outPipe        ## Alignment out
        self.options = options.copy()  ## Modifiable sub-environment
        self.sem = semaphore

    def run(self):
        inp = self.in_pipe.recv()
        while inp != 'STOP':
            seq_record , HMM = inp  # seq_record is only ever one Bio.Seq.SeqRecord object at a time.
                                    # HMM is a file location.
            align_process = subprocess.Popen( ['hmmsearch', '-A', '/dev/stdout', '-o',os.devnull, HMM, '/dev/stdin'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE )
            self.sem.acquire()
            align_process.stdin.write( seq_record.format('fasta') )
            align_process.stdin.close()
            for seq in SeqIO.parse( align_process.stdout, 'stockholm' ):  # get the alignment output
                self.out_pipe.send_bytes( seq.seq.tostring() ) # send it to consumer
            align_process.wait()   # Don't know if there's any need for this??
            self.sem.release()
            align_process.stdout.close()
            inp = self.in_pipe.recv()  
        self.in_pipe.close()    #Close handles so don't overshoot max. limit on number of file-handles.
        self.out_pipe.close()

花了一段时间调试这个，我发现了一个始终存在但尚未完全解决的问题，但在调试过程中修复了其他一些效率低下的问题。有两个初始的fed函数，这个align_seq类和一个文件解析器 parseHMM（），它将位置特定的评分矩阵（PSM）加载到字典中。然后，主父进程将对齐与PSM进行比较，使用字典（字典）作为指向每个残留的相关分数的指针。为了计算我想要的分数，我有两个单独的multiprocessing.Process类，一个类 logScore（），用于计算对数优势比（使用math.exp（））;我将这个并行化;并将计算出的分数排队到最后一个过程 sumScore（），它只是将这些分数（使用math.fsum）相加，将总和和所有位置特定分数作为字典返回到父进程。即 Queue.put（[sum，{residual position：position specific score，...}]）我发现这个让我头晕目眩的混乱（队列太多了！），所以我希望读者能够遵循......完成上述所有计算之后，我会选择将累积分数保存为标签 - 分隔输出。这是它现在（从昨晚开始）有时会破坏的地方，因为我确保它为每个应该有分数的位置打印出一个分数。我认为由于延迟（计算机时序不同步），有时首先在 logScore 的队列中输入的内容不会首先达到 sumScore 。为了使sumScore知道何时返回计数并重新开始，我将'endSEQ'放入队列中，以执行计算的最后一次logScore过程。我认为那时候它也应该达到sumScore，但情况并非总是如此;只是有时它会破裂。所以现在我不再遇到死锁，而是在打印或保存结果时出现KeyError。我认为有时候获得KeyError的原因是因为我为每个logScore进程创建了一个Queue，但是他们都应该使用相同的Queue。现在，我有类似的地方： -

class logScore( multiprocessing.Process ):
    def __init__( self, inQ, outQ ):
        self.inQ = inQ
        ...

def scoreSequence( processes, HMMPSM, sequenceInPipe ):
    process_index = -1
    sequence = sequenceInPipe.recv_bytes()
    for residue in sequence:
        .... ## Get the residue score.
        process_index += 1
        processes[process_index].inQ.put( residue_score )
    ## End of sequence
    processes[process_index].inQ.put( 'endSEQ' )


logScore_to_sumScoreQ = multiprocessing.Queue()
logScoreProcesses = [ logScore( multiprocessing.Queue() , logScore_to_sumScoreQ ) for i in xrange( options['-ncpus'] ) ]
sumScoreProcess = sumScore( logScore_to_sumScoreQ, scoresOut )

而我应该只创建一个队列来在所有logScore实例之间共享。即。

logScore_to_sumScoreQ = multiprocessing.Queue()
scoreSeq_to_logScore = multiprocessing.Queue()
logScoreProcesses = [ logScore( scoreSeq_to_logScore , logScore_to_sumScoreQ ) for i in xrange( options['-ncpus'] ) ]
sumScoreProcess = sumScore( logScore_to_sumScoreQ, scoresOut )

Answer 1

这不是流水线的工作原理......但是为了让你放松心情，这里摘录自subprocess documentation：

stdin，stdout和stderr指定了执行程序的标准输入，标准输出和标准错误文件句柄。有效值是PIPE，一个现有文件描述符（正整数），an 现有文件对象，无。管表示孩子的新管道 应该创建。没有，没有重定向将发生;孩子的文件句柄将继承自父母。

最容易出错的区域是与主要流程或信号量管理层的沟通。由于错误，状态转换/同步可能没有按预期进行？我建议在＆amp;之前添加日志/打印语句进行调试。在每个阻塞调用之后 - 您正在与主进程进行通信以及获取/释放信号量的位置，以缩小出错的地方。

我也很好奇 - 信号量是绝对必要的吗？

Answer 2

我还想并行简单的任务，为此我创建了一个小的python脚本。你可以看看： http://bioinf.comav.upv.es/psubprocess/index.html

比你想要的更通用，但对于简单的任务来说非常容易使用。至少对你来说可能是一些侮辱。

何塞布兰卡

Answer 3

它可能是子进程中的死锁，你尝试过使用通信方法而不是等待吗？ http://docs.python.org/library/subprocess.html

python多处理每个都有自己的子进程（Kubuntu，Mac）

3 个答案: