使用队列进行同步多处理

时间:2014-04-01 15:08:39

标签: python python-2.7 multiprocessing

我有一个python程序,它从文件中读取一行,处理该行,然后将其写入新文件。它会对文件中的所有行重复此操作。基本上:

for i in range(nlines):
    line = read_line(line_number = i)
    processed_line = process_line(line)
    write_line(line)

我想对它进行多处理,以便一个进程负责读写,另一个进程负责处理:

read line 1 -> read line 2 -> write line 1 -> read line 3 -> write line 2 --> etc
              process line 1 --------------> process line 2 ----------------> etc

我想我需要利用两个队列来回传递数据,虽然我真的不知道如何在实践中实现这一点。您对如何使用多处理在两个进程中拆分此问题有任何想法吗?

1 个答案:

答案 0 :(得分:0)

from multiprocessing import Process, Queue as mpq

def worker(qIn, qOut):
    for fpath,i,line in iter(qIn.get, None):
        qOut.put((fpath,i, process_line(line)))

def main(infilepath, outfilepath):
    qIn, qOut = mpq(), mpq()
    Process(target=worker, args=(qIn, qOut)).start()
    with open(infilepath) as infile, open(outfilepath, 'w') as outfile:
        numLines = 0
        for tup in enumerate(infile):
            qIn.put(tup)
            numLines += 1
        qIn.put(None)
        retlines = {}
        top = -1
        for _ in range(numLines)
            i,line = qOut.get))
            retlines[i] = line
            if i+1 in retlines:
                outfile.write(retlines.pop(i+1))
                i += 1

当然,这等待在开始写入输出文件之前完成读取输入文件,这是一个效率瓶颈。我会这样做:

def procWorker(qIn, qOut, numWriteWorkers):
    for fpath,i,line in iter(qIn.get, None):
        qOut.put((fpath,i, process_line(line)))
    for _ in range(numWriteWorkers):
        qOut.put(None)

def readWorker(qIn, qOut, numProcWorkers):
    for infilepath in iter(qIn.get, None):
        with open(infilepath) as infile:
            for line in infile:
                qOut.put((infilepath, i, line))
    for _  in range(numProcWorkers):
        qOut.put(None)

def writeWorker(qIn, qOut):
    outfilepaths = {"test1.in" : "test1.out"}  # dict that maps input filepaths to corresponding output filepaths
    lines = collections.defaultdict(dict)
    inds = collections.defaultdict(lamnda : -1)
    for fpath,i,line in iter(qIn.get, None):
        if i == inds[fpath] + 1:
            inds[fpath] += 1
            with open(outfilepaths[fpath], 'a') as outfile:
                outfile.write(line)
                qOut.put((fpath, i))
        else:
            lines[fpath][i] = line
    for fpath in lines:
        with open(outfilepaths[fpath], 'a') as outfile:
            for i in sorted(fpath):
                outfile.write(lines[fpath][i])
                qOut.put((fpath, i))
    qOut.put(None)

def main(infilepaths):
    readqIn, readqOut, procqOut, writeqOut = [Queue for _ in range(4)]
    numReadWorkers = 1  # fiddle to taste
    numWriteWorkers = 1  # fiddle to taste
    numProcWorkers = 1  # fiddle to taste

    for _ in range(numReadWorkers):
        Process(target=readWorker, args=(readqIn, readqOut, numProcWorkers)).start()
    for infilepath in infilepaths:
        readqIn.put(infilepath)
    for _ in range(numReadWorkers):
        readqIn.put(None)

    for _ in range(numProcWorkers):
        Process(target=procWorker, args=(readqOut, procqOut, numWriteWorkers)).start()

    for _ in range(numWriteWorkers):
        Process(target=writeWorker, args=(procqOut, writeqOut)).start()

    writeStops = 0
    while True:
        if writeStops == numWriteWorkers:
            break
        msg = writeqOut.get()
        if msg == None:
            writeStops += 1
        else:
            fpath, i = msg
            print("line #%d was written to file %s" %(i, fpath))

请注意,这允许多个读者和作者的可能性。通常情况下,这是毫无意义的,因为硬盘上只有头部。但是,如果您使用的是某些分布式文件系统,或者您的文件驻留在多个硬盘驱动器上,那么您可以增加读/写工作器的数量以提高效率。假设一个微不足道的process_line函数,numReadWorkers + numWriteWorkers应该等于你所有硬盘上的磁头数量。您可以平衡驱动器上的文件( la raid)以达到某种优化,但很大程度上取决于文件大小,读/写速度,缓存等。

真的,你应该得到的第一个加速来自摆弄numProcWorkers,这应该可以让你的效率线性提高,当然还有机器上逻辑核心处理器的数量