使用多处理传递大量数据

时间:2018-02-19 06:24:35

标签: python parallel-processing multiprocessing large-data

我试图弄清楚如何编写一个并行执行计算的程序,以便每个计算的结果可以按特定顺序写入文件。我的问题是大小;我想做我在下面的示例程序中概述的内容 - 将大输出保存为字典的值,该字典将订购系统存储在其密钥中。但是我的程序一直在破碎,因为它无法存储/传递这么多字节。

有没有办法解决这些问题?我是处理多处理和大数据的新手。

from multiprocessing import Process, Manager

def eachProcess(i, d):
    LARGE_BINARY_OBJECT = #perform some computation resulting in millions of bytes
    d[i] = LARGE_BINARY_OBJECT
def main():
    manager = Manager()
    d = manager.dict()
    maxProcesses = 10
    for i in range(maxProcesses):
        process = Process(target=eachProcess, args=(i,d))
        process.start()

    counter = 0
    while counter < maxProcesses:
        file1 = open("test.txt", "wb")
        if counter in d:
            file1.write(d[counter])
            counter += 1

if __name__ == '__main__':
    main()

谢谢。

1 个答案:

答案 0 :(得分:1)

处理大数据时,方法通常是两个:

  1. 本地文件系统,如果问题很简单
  2. 如果需要更复杂的数据支持,则远程数据存储
  3. 由于您的问题似乎很简单,我建议采用以下解决方案。每个进程将其部分解决方案写入本地文件。完成所有处理后,主进程将所有结果文件组合在一起。

    from multiprocessing import Pool
    from tempfile import NamedTemporaryFile
    
    def worker_function(partial_result_path):
        data = produce_large_binary()
        with open(partial_result_path, 'wb') as partial_result_file:
            partial_result_file.write(data)
    
    # storing partial results in temporary files
    partial_result_paths = [NamedTemporaryFile() for i in range(max_processes)]
    
    pool = Pool(max_processes)
    pool.map(worker_function, partial_result_paths)
    
    with open('test.txt', 'wb') as result_file:
        for partial_result_path in partial_result_paths:
            with open(partial_result_path) as partial_result_file:
                result_file.write(partial_result_file.read())