也许这很简单,但我对此有一点疑问。
我面临的挑战是从母亲功能中执行子并行功能。在等待子并行函数调用的结果时, mother 函数应该只运行一次。
我写了一个小例子来说明我的困境。
import string
from joblib import Parallel, delayed
import multiprocessing
def jobToDoById(id):
#do some other logic based on the ID given
rand_str = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for i in range(10))
return [id, rand_str]
def childFunctionParallel(jobs):
num_cores = multiprocessing.cpu_count()
num_cores = num_cores - 1
if __name__ == '__main__':
p = Parallel(n_jobs=num_cores)(delayed(jobToDoById)(i) for i in jobs)
return p
def childFunctionSerial(jobs):
result = []
for job in jobs:
job_result = jobToDoById(job)
result.append(job_result)
return result
def motherFunction(countries_cities, doInParallel):
result = []
print("Start mainLogic")
for country in countries_cities:
city_list = countries_cities[country]
if(doInParallel):
cities_result = childFunctionParallel(city_list)
else:
cities_result = childFunctionSerial(city_list)
result.append(cities_result)
# ..... do some more logic
# ..... do some more logic before returning
print("End mainLogic")
return result
print("Start Program")
countries_cities = {
"United States" : ["Alabama", "Hawaii", "Mississippi", "Pennsylvania"],
"United Kingdom" : ["Cambridge", "Coventry", "Gloucester", "Nottingham"],
"France" : ["Marseille", "Paris", "Saint-Denis", "Nanterre", "Aubervilliers"],
"Denmark" : ["Aarhus", "Slagelse", "Nykøbing F", "Rønne", "Odense"],
"Australia" : ["Sydney", "Townsville", "Bendigo", "Bathurst", "Busselton"],
}
result_mother = motherFunction(countries_cities, doInParallel=True) # should be executed only once
print(result_mother)
print("End Program")
如果您在doInParallel
和True
之间切换False
,那么您可以看到问题。使用childFunctionSerial()
运行时,motherFunction()
仅运行一次。但是当我们使用childFunctionParallel
运行时,motherFunction()
会被执行多次。两者都给出了相同的结果,但我遇到的问题是motherFunction()
只能执行一次。
两个问题:
1。如何重组程序,以便我们执行一次母函数
从内部开始并行作业而不运行同一母函数的多个实例?
2. 除了 jobToDoById()
之外,如何将第二个参数传递给id
?
答案 0 :(得分:1)
( id, .., )
这个很简单并且是常用的,所以人们可以在很多例子中遇到它。
def jobToDoById( aTupleOfPARAMs = ( -1, ) ): # jobToDoById(id):
# # do some other logic based on the ID given
if not type( aTupleOfPARAMs ) is tuple: # FUSE PROTECTION
return [-1, "call interface violated"]
if aTupleOfPARAMs[0] == -1: # FUSE PROTECTION
return [-1, None]
# .......................................# GO GET PROCESSED:
rand_str = ''.join( random.choice( string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits
)
for i in range( 10 )
)
return [id, rand_str]
第一个问题有点困难,但更有趣的是,系统设计的principal differences among [SERIAL]
, "just"-[CONCURRENT]
and true-[PARALLEL]
system-scheduling policies of more than one processes在大众媒体中并不总是受到尊重(有时甚至不在学术界)。
您的代码明确提到了 joblib.Parallel
和 multiprocessing
模块,但文档说:
默认
Parallel
使用Pythonmultiprocessing
模块 fork单独的Python worker 进程< / strong>在不同的CPU上同时执行任务。这是通用Python程序的合理默认值,但由于输入和输出数据 需要在队列中序列化进行通信,因此会产生一些开销与工人流程。
有两个含义 - 您的处理将支付双重,[TIME]
-domain and [SPACE]
-domain overhead costs, that may easily become unacceptably huge OVERHEAD COSTS ( and if one has already noticed also the words "data" and "serialized" in the citation above, the better ) - for details see re-formulated Amdahl's Law, as detailed in Section: Criticism et al parallelism-amdahl:
1)整个Python解释器包括它的数据和内部状态是完全分叉的(因此你可以获得尽可能多的副本,每个副本只运行一个进程流,这是为了在GIL-round-robin碎片上没有失去性能/只有1次运行 - 所有其他必须等待的GIL阻塞/步进类型如果在基于线程的池等中进行,则会出现任何1+处理流程。 )
2)除了必须按照上面提到的所有完整的Python解释器+状态重新实例化之外,还所有 <data-IN>
+ <data-OUT>
是:
----------------------------MAIN-starts-to-escape-from-pure-[SERIAL]-processing--
0: MAIN forks self
[1]
[2]
...
[n_jobs] - as many copies of self as requested
-------------------------MAIN-can-continue-in-"just"-[CONCURRENT]-after:
1st-Data-IN-SERialised-in-MAIN's-"__main__"
+ 2nd-Data-IN-QUEueed in MAIN
+ 3rd-Data-IN-DEQueued [ith_job]s
+ 4th-Data-IN-DESerialised [ith_job]s
+ ( ...process operated the usefull [ith_job]s -<The PAYLOAD>-planned... )
+ 5th-Data-OUT-SERialised [ith_job]s
+ 6th-Data-OUT-QUEued [ith_job]s
+ 7th-Data-OUT-DEQueued in-MAIN
+ 8th-Data-OUT-DESerialised-in-MAIN's-"__main__"
-------------------------------MAIN-can-continue-in-pure-[SERIAL]-processing-----
总是花费不可忽略的开销时间(对于方程式和细节,请参考: overhead-strict re-formulation of net-speedups achievable upon these add-on overhead costs,最好在进入重构之前,您的机器支付的费用将超过试图忽略这些主要和可基准的管理费用的费用)
为了对这些开销成本进行基准测试,每个单独的,以微秒为单位的测量,都可以使用工具(但并非所有StackOverflow成员都对这些工具的定量稳健基准测试感到高兴),只需在StackOverflow上查看parallelism-amdahl上的其他帖子。
joblib.Parallel
实施的第二个主要限制,即使不是总结,也会进入Amdahl定律,是资源 - 真实可用性 - 不可知的乐观,而资源状态感知调度是在每个真实世界系统上发生的事情。
可以预期任何高度并行代码执行,但除非在端到端(从上到下)的系统覆盖范围内采取复杂的措施,否则所有处理都会进入“正常” - [CONCURRENT]
计划(即如果资源允许)。这方面是扩展这篇文章的足迹的方式,并且只是天真地投入到上面的方案中,表明如果CPU核心(并且主要是任何其他资源类)不可用,那么并发性
will never reach the levels of speedup, that a resources-availability agnostic original Amdahl's Law was promising
----------------------------MAIN-starts-escape-from-processing---in-pure-[SERIAL]
0: MAIN forks self -in-pure-[SERIAL]
[1] -in-pure-[SERIAL]
[2] -in-pure-[SERIAL]
... -in-pure-[SERIAL]
[n_jobs] as many copies of self-in-pure-[SERIAL]
as requested -in-pure-[SERIAL]
--------------------------MAIN-can-continue-in-"just"-[CONCURRENT]after[SERIAL]
+ 1st-Data-IN-SERialised-in-MAIN's-"__main__" , job(2), .., job(n_jobs):[SERIAL]
+ 2nd-Data-IN-QEUueed in MAIN for all job(1), job(2), .., job(n_jobs):[SERIAL]
+ 3rd-Data-IN-DEQueued [ith_job]s: "just"-[CONCURRENT]||X||X||
+ 4th-Data-IN-DESerialised [ith_job]s: "just"-[CONCURRENT]|X||X|||
+ ( ...process operated the usefull [ith_job]s-<The PAYLOAD>-planned... )||X|||X|
+ 5th-Data-OUT-SERialised [ith_job]s: "just"-[CONCURRENT]||||X|||
+ 6th-Data-OUT-QUEued [ith_job]s: "just"-[CONCURRENT]|X|X|X||
+ 7th-Data-OUT-DEQueued in-MAIN <--l job(1), job(2), .., job(n_jobs):[SERIAL]
+ 8th-Data-OUT-DESerialised-in-MAIN's-"__main__" job(2), .., job(n_jobs):[SERIAL]
-------------------------------MAIN-can-continue-processing------in-pure-[SERIAL]
... -in-pure-[SERIAL]