Python的multiprocessing.Pool.imap
非常方便逐行处理大型文件:
import multiprocessing
def process(line):
processor = Processor('some-big.model') # this takes time to load...
return processor.process(line)
if __name__ == '__main__':
pool = multiprocessing.Pool(4)
with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
for processed_line in pool.imap(process, infile):
outfile.write(processed_line)
如何确保上面示例中的辅助程序(例如Processor
)仅被加载一次?完全不借助涉及队列的更复杂/冗长的结构,这是否可能?
答案 0 :(得分:0)
multiprocessing.Pool
允许通过initializer
和initarg
参数初始化资源。得知该想法是利用全局变量,我感到很惊讶,如下所示:
import multiprocessing as mp
def init_process(model):
global processor
processor = Processor(model) # this takes time to load...
def process(line):
return processor.process(line) # via global variable `processor` defined in `init_process`
if __name__ == '__main__':
pool = mp.Pool(4, initializer=init_process, initargs=['some-big.model'])
with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
for processed_line in pool.imap(process, infile):
outfile.write(processed_line)
multiprocessing.Pool
's documentation中对该概念的描述不是很好,所以我希望这个示例对其他人有帮助。