使用多处理映射Reduce

时间:2016-07-13 21:43:06

标签: python mapreduce

import multiprocessing

data = range(10)

def map_func(i):
    return [i]

def reduce_func(a,b):
    return a+b

p = multiprocessing.Pool(processes=4)
p.map(map_func, data)

如何使用reduce_func()作为并列map_func()的缩减功能。

以下是我想要做的pySpark示例:

rdd = sc.parallelize(data)
result = rdd.map(map_func)
final_result = result.reduce(reduce_func)

1 个答案:

答案 0 :(得分:1)

根据文档,multiprocessing.Pool.map()会阻止,直到结果准备就绪。随机性是不可能的。要实现随机处理顺序,请使用imap_unordered()方法:

from functools import reduce

result = p.imap_unordered(map_func, data)
final_result = reduce(reduce_func, result)

# Three different runs:
# [0, 1, 4, 5, 2, 6, 8, 9, 7, 3]
# [0, 1, 4, 5, 2, 3, 8, 7, 6, 9]
# [0, 1, 2, 5, 6, 7, 8, 4, 3, 9]