import multiprocessing as mp
import numpy as np
pool = mp.Pool( processes = 4 )
inp = np.linspace( 0.01, 1.99, 100 )
result = pool.map_async( func, inp ) #Line1 ( func is some Python function which acts on input )
output = result.get() #Line2
所以,我试图在.map_async()
实例上使用 multiprocessing.Pool()
方法在Python中并行化一些代码。
我注意到了
Line1
需要大约千分之一秒,
Line2
需要大约0.3秒。
是否有更好的方法来解决由Line2
引起的瓶颈问题或解决问题的方法
或
我在这里做错了吗?
(我对此很新。)
答案 0 :(得分:0)
我在这里做错了吗?
这是一个常见的讲座,不是关于使用一些" 有希望的"语法构造函数,但是要支付使用它的实际成本。
故事很长,效果很简单 - 你期望一个低悬的果实,但不得不支付过程实例化,工作包重新分配和收集结果的巨大成本,所有马戏团只是为了做但是几轮func()
- 来电。
Well, who told you that any such ( potential ) speedup is for free?
让我们量化,而不是测量实际的代码执行时间,而不是情绪,对吗?
基准测试始终是一个公平的举措 它帮助我们,凡人,逃避期望 并使我们自己进入定量记录证据支持知识:
from zmq import Stopwatch; aClk = Stopwatch() # this is a handy tool to do so
在向前移动之前,应该记录这一对:
>>> aClk.start(); _ = [ func( SEQi ) for SEQi in inp ]; aClk.stop() # [SEQ]
>>> HowMuchWillWePAY2RUN( func, 4, 100 ) # [RUN]
>>> HowMuchWillWePAY2MAP( func, 4, 100 ) # [MAP]
如果希望扩展实验,这将设置从纯[SERIAL]
[SEQ] - 调用,到未优化joblib.Parallel()
或任何其他的性能包络中的跨度使用任何其他工具,例如所说的multiprocessing.Pool()
或其他工具。
<强>意图:强>
以便衡量{process |的成本job -instantiation,我们需要一个NOP-work-package有效载荷,它几乎不会花费任何东西&#34;那里&#34;但返回&#34;返回&#34;并且不需要支付任何额外的附加费用(无论是任何输入参数&#39;传输或返回任何值)
def a_NOP_FUN( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is indeed to do nothing at all,
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
pass
因此,设置开销附加成本比较在这里:
#-------------------------------------------------------<function a_NOP_FUN
[SEQ]-pure-[SERIAL] worked within ~ 37 .. 44 [us] on this localhost
[MAP]-just-[CONCURENT] tool 2536 .. 7343 [us]
[RUN]-just-[CONCURENT] tool 111162 .. 112609 [us]
joblib.delayed()
任务处理中使用joblib.Parallel()
的策略:def HowMuchWillWePAY2RUN( aFun2TEST = a_NOP_FUN, JOBS_TO_SPAWN = 4, RUNS_TO_RUN = 10 ):
from zmq import Stopwatch; aClk = Stopwatch()
try:
aClk.start()
joblib.Parallel( n_jobs = JOBS_TO_SPAWN
)( joblib.delayed( aFun2TEST )
( aFunPARAM )
for ( aFunPARAM )
in range( RUNS_TO_RUN )
)
except:
pass
finally:
try:
_ = aClk.stop()
except:
_ = -1
pass
pass; pMASK = "CLK:: {0:_>24d} [us] @{1: >4d}-JOBs ran{2: >6d} RUNS {3:}"
print( pMASK.format( _,
JOBS_TO_SPAWN,
RUNS_TO_RUN,
" ".join( repr( aFun2TEST ).split( " ")[:2] )
)
)
.map_async()
实例上使用轻量级multiprocessing.Pool()
方法的策略:def HowMuchWillWePAY2MAP( aFun2TEST = a_NOP_FUN, PROCESSES_TO_SPAWN = 4, RUNS_TO_RUN = 1 ):
from zmq import Stopwatch; aClk = Stopwatch()
try:
import numpy as np
import multiprocessing as mp
pool = mp.Pool( processes = PROCESSES_TO_SPAWN )
inp = np.linspace( 0.01, 1.99, 100 )
aClk.start()
for i in xrange( RUNS_TO_RUN ):
pass; result = pool.map_async( aFun2TEST, inp )
output = result.get()
pass
except:
pass
finally:
try:
_ = aClk.stop()
except:
_ = -1
pass
pass; pMASK = "CLK:: {0:_>24d} [us] @{1: >4d}-PROCs ran{2: >6d} RUNS {3:}"
print( pMASK.format( _,
PROCESSES_TO_SPAWN,
RUNS_TO_RUN,
" ".join( repr( aFun2TEST ).split( " ")[:2] )
)
)
所以,
第一组痛苦和惊喜
直接在 joblib.Parallel()
的并发池中实际做得费用:
CLK:: __________________117463 [us] @ 4-JOBs ran 10 RUNS <function a_NOP_FUN
CLK:: __________________111182 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________110229 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________110095 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________111794 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________110030 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________110697 [us] @ 3-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: _________________4605843 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________336208 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________298816 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________355492 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________320837 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________308365 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________372762 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________304228 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________337537 [us] @ 123-JOBs ran 100 RUNS <function a_NOP_FUN
CLK:: __________________941775 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
CLK:: __________________987440 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
CLK:: _________________1080024 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
CLK:: _________________1108432 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
CLK:: _________________7525874 [us] @ 123-JOBs ran100000 RUNS <function a_NOP_FUN
所以,这个科学公平严谨的测试从这个最简单的案例开始,已经显示了所有相关代码执行处理设置的基准成本 - 开销有史以来最小 joblib.Parallel()
惩罚sine-qua-non 。
这将我们推向了一个方向,现实世界的算法确实存在 - 最好是在测试循环中添加一些越来越大的有效载荷#34; -
[CONCURRENT]
代码执行的惩罚是什么?下一个?使用这种系统而轻量级的方法,我们可能会在故事中前进,因为我们还需要对附加成本和其他Amdahl法律{ remote-job-PAR-XFER(s) | remote-job-MEM.alloc(s) | remote-job-CPU-bound-processing | remote-job-fileIO(s) }
这样的函数模板可能有助于重新测试(如您所见,将会有很多重新运行,而O / S噪声和一些其他工件将进入实际的使用成本模式) :
一旦我们支付了前期费用,下一个最常见的错误就是忘记了内存分配的成本。所以,让我们测试一下:
def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR, SIZE1D = 1000 ):
""" __doc__
The intent of this FUN() is to do nothing but
a MEM-allocation
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
aMemALLOC = np.zeros( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
return aMemALLOC[2:3,3,4,5]
如果您的平台停止以便能够分配所请求的内存块,那么我们会遇到另一类问题(如果尝试在物理资源中并行,则会出现一类隐藏的玻璃天花板不可知的方式)。可以编辑SIZE1D
缩放,以便至少适合平台RAM寻址/大小调整功能,然而,真实世界问题计算的性能范围仍然是我们的兴趣所在:
>>> HowMuchWillWePAY2RUN( a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR, 200, 1000 )
可能会产生
支付费用,介于 0.1 [s]
和+9 [s]
之间的任何内容(!!)
只是为了做什么仍然没有,但现在也不忘了关于一些实际的MEM分配附加费用&#34; 那里 &#34; 强>
CLK:: __________________116310 [us] @ 4-JOBs ran 10 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________120054 [us] @ 4-JOBs ran 10 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________129441 [us] @ 10-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________123721 [us] @ 10-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________127126 [us] @ 10-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________124028 [us] @ 10-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________305234 [us] @ 100-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________243386 [us] @ 100-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________241410 [us] @ 100-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________267275 [us] @ 100-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________244207 [us] @ 100-JOBs ran 100 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________653879 [us] @ 100-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________405149 [us] @ 100-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________351182 [us] @ 100-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________362030 [us] @ 100-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: _________________9325428 [us] @ 200-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________680429 [us] @ 200-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________533559 [us] @ 200-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: _________________1125190 [us] @ 200-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
CLK:: __________________591109 [us] @ 200-JOBs ran 1000 RUNS <function a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR
善意read the tail sections of this post
善意read the tail sections of this post
对于每一个&#34;承诺&#34;,在开始任何代码重新设计之前,最好的下一步是首先交叉验证实际的代码执行成本。实际平台的附加成本总和可能会破坏任何预期的加速,即使原始的,开销天真的Amdahl定律可能会产生一些预期的加速效果。
正如Walter E. Deming先生多次表达的那样,没有DATA,我们只留下了OPINIONS 。
奖励部分:
在这里读过,人们可能已经发现,没有任何一种“缺点”。或&#34;错误&#34;在#Line2
本身,但细致的设计实践将显示任何更好的语法构造函数,花费更少来实现更多(因为实际资源(CPU,MEM,IO,O / S)允许代码执行平台)。其他任何事情都不会与盲目告诉“财富”有所不同。 子>