import multiprocessing
data = range(10)
def map_func(i):
return [i]
def reduce_func(a,b):
return a+b
p = multiprocessing.Pool(processes=4)
p.map(map_func, data)
如何使用reduce_func()
作为并列map_func()
的缩减功能。
以下是我想要做的pySpark
示例:
rdd = sc.parallelize(data)
result = rdd.map(map_func)
final_result = result.reduce(reduce_func)
答案 0 :(得分:1)
根据文档,multiprocessing.Pool.map()
会阻止,直到结果准备就绪。随机性是不可能的。要实现随机处理顺序,请使用imap_unordered()
方法:
from functools import reduce
result = p.imap_unordered(map_func, data)
final_result = reduce(reduce_func, result)
# Three different runs:
# [0, 1, 4, 5, 2, 6, 8, 9, 7, 3]
# [0, 1, 4, 5, 2, 3, 8, 7, 6, 9]
# [0, 1, 2, 5, 6, 7, 8, 4, 3, 9]