Python /多处理:进程似乎没有开始

时间:2014-08-10 05:17:31

标签: python linux parallel-processing multiprocessing

我有一个函数,它读取二进制文件并将每个字节转换为相应的字符序列。例如,0x05变为'AACC',0x2A变为'AGGG'等...读取文件并转换字节的函数目前是线性的,因为要转换的文件在25kb到2Mb之间,这可能需要相当一会儿。

因此,我正在尝试使用多处理来划分任务,并希望提高速度。但是,我无法让它发挥作用。下面是线性函数,尽管速度很慢,但仍可正常工作;

def fileToRNAString(_file):

    if (_file and os.path.isfile(_file)):
        rnaSequences = []
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                decSequenceToRNA(blockCount, buf, rnaSequences)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

注意:函数' decSequenceToRNA '读取缓冲区并将每个字节转换为所需的字符串。在执行时,该函数返回一个元组,该元组包含块号和字符串,例如(1,'ACCGTAGATTA ......')最后,我有一系列这些元组可用。

我试图将函数转换为使用Python的多处理;

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        workers = []
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
                p.start()
                workers.append(p)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        for p in workers:
            p.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

但是,似乎没有进程启动,因为运行此函数时,将返回一个空数组。在' decSequenceToRNA '中打印到控制台的任何消息都不会显示;

>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).

与此question不同,我正在运行 Linux shiva 3.14-kali1-amd64#1 SMP Debian 3.14.5-1kali1(2014-06-07)x86_64 GNU / Linux 并使用PyCrust测试Python版本上的函数:2.7.3。我正在使用以下软件包:

import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process

我想帮助弄清楚为什么我的代码不起作用,如果我在其他地方缺少某些东西以使Process工作。也欢迎提出改进代码的建议。以下是' decSequenceToRNA '供参考:

def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    _rnaSequences.append((_idxSeq, rnaSequence))

2 个答案:

答案 0 :(得分:1)

decSequenceToRNA正在自己的进程中运行,这意味着它获得了主进程中每个数据结构的独立副本。这意味着当您在_rnaSequences中追加到decSequenceToRNA时,它对父进程中的rnaSequences没有影响。这可以解释为什么要返回一个空列表。

您有两种方法可以解决这个问题。首先,是创建一个list,可以使用multiprocessing.Manager在进程之间共享。例如:

import multiprocessing

def f(shared_list):
    shared_list.append(1)

if __name__ == "__main__":
    normal_list = []
    p = multiprocessing.Process(target=f, args=(normal_list,))
    p.start()
    p.join()
    print(normal_list)

    m = multiprocessing.Manager()
    shared_list = m.list()
    p = multiprocessing.Process(target=f, args=(shared_list,))
    p.start()
    p.join()
    print(shared_list)

输出:

[]   # Normal list didn't work, the appended '1' didn't make it to the main process
[1]  # multiprocessing.Manager() list works fine

将此代码应用于您的代码只需要替换

rnaSequences = []

使用

m = multiprocessing.Manager()
rnaSequences = m.list()

或者,您可以(也可能应该)使用multiprocessing.Pool而不是为每个块创建单独的Process。我不确定hFile有多大,或者你正在阅读的块有多大,但是如果有超过multiprocessing.cpu_count()个块,你就会损害性能通过为每个块生成进程。使用Pool,您可以保持流程计数不变,并轻松创建rnaSequence列表:

def decSequenceToRNA(_idxSeq, _byteSequence):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    return _idxSeq, rnaSequence

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        results = []
        p = multiprocessing.Pool()  # Creates a pool of cpu_count() processes
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                result = pool.apply_async(decSequenceToRNA, blockCount, buf)
                results.append(result)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        rnaSequences = [r.get() for r in results]
        pool.close()
        pool.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

请注意,我们不再将rnaSequences列表传递给孩子。相反,我们只返回我们已经回溯到父级的结果(我们可以对Process进行处理),并在那里构建列表。

答案 1 :(得分:-1)

尝试写这个(参数列表末尾的逗号)

p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))
相关问题