Question

问题：我已经开始在机器上开始执行某些海量数据处理的DAG（定向非循环图）结构。某些进程只能在其父数据处理完成时启动，因为存在多级处理。我想使用python多处理库来处理它的所有单个机器作为第一个目标，然后使用Managers在不同的机器上进行扩展。我以前没有使用python多处理的经验。任何人都可以建议，如果它是一个很好的图书馆开始？如果是的话，一些基本的实现想法就可以了。如果没有，还有什么可以用来在python中做这件事？

示例：

A - ＆gt;乙

B - ＆gt; D，E，F，G

C - ＆gt; d

在上面的例子中，我想踢A＆amp; C首先（并行），在成功执行之后，其他剩余的进程将等待B先完成。一旦B完成执行，所有其他进程就会开始。

P.S。：很抱歉，由于机密，我无法分享实际数据，尽管我试图用这个例子说清楚。

Answer 1

我非常喜欢使用进程和队列来做这样的事情。

像这样：

from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time

#example process functions
def processA(queueA, queueB):
    while True:
        try:
            data = queueA.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data
        queueB.put(data)

def processA(queueB, _):
    while True:
        try:
            data = queueB.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data

#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
    procs = []
    for _ in range(num_workers):
        p = Process(target=target_function, args=args)
        p.start()
        procs.append(p)
    return procs

def shutdown_process(proc_lst, queue):
    for _ in proc_lst:
        queue.put('END')
    for p in proc_lst:
        try:
            p.join()
        except KeyboardInterrupt:
            break

queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)

procsA = start_procs(number_of_workers, processA, (queueA, queueB)) 
procsB = start_procs(number_of_workers, processB, (queueB, None))  

# feed some data to processA
[queueA.put(data) for data in start_data]  

#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)

#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire

使用python多处理

1 个答案: