我有 n 个文件,可以使用相同的Python脚本analysis.py
分别独立地进行分析。在包装器脚本wrapper.py
中,我循环遍历这些文件,并使用analysis.py
作为单独的进程调用subprocess.Popen
:
for a_file in all_files:
command = "python analysis.py %s" % a_file
analysis_process = subprocess.Popen(
shlex.split(command),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
analysis_process.wait()
现在,我想使用我机器的所有 k CPU内核,以加快整个分析。
只要有要分析的文件,有没有办法让k-1
运行进程?
答案 0 :(得分:3)
这概述了如何使用完全存在于这些任务中的multiprocessing.Pool:
from multiprocessing import Pool, cpu_count
# ...
all_files = ["file%d" % i for i in range(5)]
def process_file(file_name):
# process file
return "finished file %s" % file_name
pool = Pool(cpu_count())
# this is a blocking call - when it's done, all files have been processed
results = pool.map(process_file, all_files)
# no more tasks can go in the pool
pool.close()
# wait for all workers to complete their task (though we used a blocking call...)
pool.join()
# ['finished file file0', 'finished file file1', ... , 'finished file file4']
print results
添加乔尔的评论,提到一个常见的陷阱:
确保传递给pool.map()的函数仅包含在模块级别定义的对象。 Python多处理使用pickle在进程之间传递对象,而pickle在嵌套范围中定义的函数之类的问题上存在问题。