问题:我已经开始在机器上开始执行某些海量数据处理的DAG(定向非循环图)结构。某些进程只能在其父数据处理完成时启动,因为存在多级处理。我想使用python多处理库来处理它的所有单个机器作为第一个目标,然后使用Managers在不同的机器上进行扩展。我以前没有使用python多处理的经验。任何人都可以建议,如果它是一个很好的图书馆开始?如果是的话,一些基本的实现想法就可以了。如果没有,还有什么可以用来在python中做这件事?
示例:
A - >乙
B - > D,E,F,G
C - > d
在上面的例子中,我想踢A& C首先(并行),在成功执行之后,其他剩余的进程将等待B先完成。一旦B完成执行,所有其他进程就会开始。
P.S。:很抱歉,由于机密,我无法分享实际数据,尽管我试图用这个例子说清楚。
答案 0 :(得分:1)
我非常喜欢使用进程和队列来做这样的事情。
像这样:
from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time
#example process functions
def processA(queueA, queueB):
while True:
try:
data = queueA.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
queueB.put(data)
def processA(queueB, _):
while True:
try:
data = queueB.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
procs = []
for _ in range(num_workers):
p = Process(target=target_function, args=args)
p.start()
procs.append(p)
return procs
def shutdown_process(proc_lst, queue):
for _ in proc_lst:
queue.put('END')
for p in proc_lst:
try:
p.join()
except KeyboardInterrupt:
break
queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)
procsA = start_procs(number_of_workers, processA, (queueA, queueB))
procsB = start_procs(number_of_workers, processB, (queueB, None))
# feed some data to processA
[queueA.put(data) for data in start_data]
#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)
#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire