我有一个python程序,它从文件中读取一行,处理该行,然后将其写入新文件。它会对文件中的所有行重复此操作。基本上:
for i in range(nlines):
line = read_line(line_number = i)
processed_line = process_line(line)
write_line(line)
我想对它进行多处理,以便一个进程负责读写,另一个进程负责处理:
read line 1 -> read line 2 -> write line 1 -> read line 3 -> write line 2 --> etc
process line 1 --------------> process line 2 ----------------> etc
我想我需要利用两个队列来回传递数据,虽然我真的不知道如何在实践中实现这一点。您对如何使用多处理在两个进程中拆分此问题有任何想法吗?
答案 0 :(得分:0)
from multiprocessing import Process, Queue as mpq
def worker(qIn, qOut):
for fpath,i,line in iter(qIn.get, None):
qOut.put((fpath,i, process_line(line)))
def main(infilepath, outfilepath):
qIn, qOut = mpq(), mpq()
Process(target=worker, args=(qIn, qOut)).start()
with open(infilepath) as infile, open(outfilepath, 'w') as outfile:
numLines = 0
for tup in enumerate(infile):
qIn.put(tup)
numLines += 1
qIn.put(None)
retlines = {}
top = -1
for _ in range(numLines)
i,line = qOut.get))
retlines[i] = line
if i+1 in retlines:
outfile.write(retlines.pop(i+1))
i += 1
当然,这等待在开始写入输出文件之前完成读取输入文件,这是一个效率瓶颈。我会这样做:
def procWorker(qIn, qOut, numWriteWorkers):
for fpath,i,line in iter(qIn.get, None):
qOut.put((fpath,i, process_line(line)))
for _ in range(numWriteWorkers):
qOut.put(None)
def readWorker(qIn, qOut, numProcWorkers):
for infilepath in iter(qIn.get, None):
with open(infilepath) as infile:
for line in infile:
qOut.put((infilepath, i, line))
for _ in range(numProcWorkers):
qOut.put(None)
def writeWorker(qIn, qOut):
outfilepaths = {"test1.in" : "test1.out"} # dict that maps input filepaths to corresponding output filepaths
lines = collections.defaultdict(dict)
inds = collections.defaultdict(lamnda : -1)
for fpath,i,line in iter(qIn.get, None):
if i == inds[fpath] + 1:
inds[fpath] += 1
with open(outfilepaths[fpath], 'a') as outfile:
outfile.write(line)
qOut.put((fpath, i))
else:
lines[fpath][i] = line
for fpath in lines:
with open(outfilepaths[fpath], 'a') as outfile:
for i in sorted(fpath):
outfile.write(lines[fpath][i])
qOut.put((fpath, i))
qOut.put(None)
def main(infilepaths):
readqIn, readqOut, procqOut, writeqOut = [Queue for _ in range(4)]
numReadWorkers = 1 # fiddle to taste
numWriteWorkers = 1 # fiddle to taste
numProcWorkers = 1 # fiddle to taste
for _ in range(numReadWorkers):
Process(target=readWorker, args=(readqIn, readqOut, numProcWorkers)).start()
for infilepath in infilepaths:
readqIn.put(infilepath)
for _ in range(numReadWorkers):
readqIn.put(None)
for _ in range(numProcWorkers):
Process(target=procWorker, args=(readqOut, procqOut, numWriteWorkers)).start()
for _ in range(numWriteWorkers):
Process(target=writeWorker, args=(procqOut, writeqOut)).start()
writeStops = 0
while True:
if writeStops == numWriteWorkers:
break
msg = writeqOut.get()
if msg == None:
writeStops += 1
else:
fpath, i = msg
print("line #%d was written to file %s" %(i, fpath))
请注意,这允许多个读者和作者的可能性。通常情况下,这是毫无意义的,因为硬盘上只有头部。但是,如果您使用的是某些分布式文件系统,或者您的文件驻留在多个硬盘驱动器上,那么您可以增加读/写工作器的数量以提高效率。假设一个微不足道的process_line
函数,numReadWorkers
+ numWriteWorkers
应该等于你所有硬盘上的磁头数量。您可以平衡驱动器上的文件( la raid)以达到某种优化,但很大程度上取决于文件大小,读/写速度,缓存等。
真的,你应该得到的第一个加速来自摆弄numProcWorkers
,这应该可以让你的效率线性提高,当然还有机器上逻辑核心处理器的数量