Question

我有 n 个文件，可以使用相同的Python脚本analysis.py分别独立地进行分析。在包装器脚本wrapper.py中，我循环遍历这些文件，并使用analysis.py作为单独的进程调用subprocess.Popen：

for a_file in all_files:
    command = "python analysis.py %s" % a_file
    analysis_process = subprocess.Popen(
                                            shlex.split(command),
                                            stdout=subprocess.PIPE,
                                            stderr=subprocess.PIPE)
    analysis_process.wait()

现在，我想使用我机器的所有 k CPU内核，以加快整个分析。只要有要分析的文件，有没有办法让k-1运行进程？

Answer 1

这概述了如何使用完全存在于这些任务中的multiprocessing.Pool：

from multiprocessing import Pool, cpu_count

# ...
all_files = ["file%d" % i for i in range(5)]


def process_file(file_name):
    # process file
    return "finished file %s" % file_name

pool = Pool(cpu_count())

# this is a blocking call - when it's done, all files have been processed
results = pool.map(process_file, all_files)

# no more tasks can go in the pool
pool.close()

# wait for all workers to complete their task (though we used a blocking call...)
pool.join()


# ['finished file file0', 'finished file file1',  ... , 'finished file file4']
print results

添加乔尔的评论，提到一个常见的陷阱：

确保传递给pool.map（）的函数仅包含在模块级别定义的对象。 Python多处理使用pickle在进程之间传递对象，而pickle在嵌套范围中定义的函数之类的问题上存在问题。

The docs for what can be pickled

python，subprocess：当一个（在一个组中）终止时启动新进程

1 个答案: