Question

我想做什么？

一个读取大文件（1-8G）的Python脚本，匹配一个名为record的字段（可以是＆＃34; r＆＃34;或＆＃34; f＆＃34;）并且可以具有相同的ID。如果ID相同，请结合＆＃34; f＆＃34;和＆＃34; r＆＃34;行并将其保存到以字符串为键的字典中。为了加快进程，我将文件分成多个小文件，并在不同进程中的不同文件上运行该函数。最后组合每个过程创建的词典。

首先我尝试使用多处理池，但它使处理速度变慢。读取某处尝试多处理进程和队列，但我试图将字典（这是大的）放在队列中，看起来像Queue无法处理大小。此脚本挂起。

关于如何改善处理时间的任何建议？

PS：我确信这是CPU绑定的，因为我做了＃34;传递＆＃34;在阅读这些行时，花了97秒，当运行整个脚本时，花了600秒。

import multiprocessing
def rLine(line, resultDict):
     id= line[2]
     rOutput= line[4]+" "+line[5] #this is a long string
     rList= rOutput.split(" ")
     resultDict[id].extend(rList)

def fLine(line, resultDict):
     id= line[7]
     fOutput= line[4]+" "+line[5] #this is a long string
     fList= rOutput.split(" ")
     resultDict[id].extend(fList)

def mainFunction(thisFile, output):
    resultDict= defaultdict(list)
    with gzip.open(thisFile) as f:
        for lines in f:
            line= lines.split(" ")
            record= line[0]
            if record=="f":
               fLine(line, resultDict)
            else:
               rLine(line, resultDict)
   output.put(resultDict)
if __name__=='__main__':
   finalDict= defaultdict(list)
   output= multiprocessing.Queue()
   processes= list()
   files= ["f1.log.gz", "f2.log.gz"] #long list
   for f in files:
       process.append(multiprocessing.Process(target=mainFunction, args=(f,output))
   for p in processes:
       p.start()
   for p in processes:
       p.join()
   for p in process:
       finalDict.update(output.get())
   print finalDict

如何使用multiprocessing.Process当输出很大时进行多处理.Queue？

0 个答案: