Question

我遇到了来自大型csv文件的并行计算数据的问题。问题是无法并行读取文件，但可以传递来自文件的数据块以进行并行计算。我尝试使用没有结果的Multiprocessing.Pool（Pool.imap不接受yield生成器）。

我有一个用于从文件中读取数据块的生成器。需要大约3秒从文件中获取一个数据块。处理这个数据块需要大约2秒我从文件中获取了50块数据。等待下一个文件块我可以计算上一个块“并行”。

我们在概念中有一些代码（但在实践中不起作用）。：

def file_data_generator(path):
    # file reading chunk by chunk 
    yield datachunk

def compute(datachunk):
    # some heavy computation 2.sec
    return partial_result

from multiprocessing import Pool
p = Pool()
result = p.imap(compute, file_data_generator(path) ) # yield is the issue?

我做错了什么？我应该使用其他任何工具吗？它是Python3.5

简单的代码概念/骨架赞赏：）

Answer 1

你非常接近。 yield生成器位是正确的：imap 将生成器作为参数并在其上运行next()，因此yield是正确的这个背景。

您遗漏的是imap未阻止，这意味着result = p.imap调用正在返回，即使进程尚未完成。你要么做

p.close()
p.join()

然后整体使用results做一些事情，或者只是迭代结果。这是一个有效的例子：

from multiprocessing import Pool, Queue

def compute(line):
    # some heavy computation 2.sec
    return len(line)

def file_data_generator(path):
    # file reading chunk by chunk 
    with open('book.txt') as f:
        for line in f:
            yield line.strip()

if __name__ == '__main__':
    p = Pool()
    # start processes, they are still blocked because queue is empty
    # results is a generator and is empty at the start
    results = p.imap(compute, file_data_generator('book.txt'))

    # now we tell pool that we finished filling the queue
    p.close()
    for res in results:
        print(res)

来自文件的并行过程数据

1 个答案: