使用python多处理

时间:2017-06-28 14:54:00

标签: python process multiprocessing python-multiprocessing

问题:我已经开始在机器上开始执行某些海量数据处理的DAG(定向非循环图)结构。某些进程只能在其父数据处理完成时启动,因为存在多级处理。我想使用python多处理库来处理它的所有单个机器作为第一个目标,然后使用Managers在不同的机器上进行扩展。我以前没有使用python多处理的经验。任何人都可以建议,如果它是一个很好的图书馆开始?如果是的话,一些基本的实现想法就可以了。如果没有,还有什么可以用来在python中做这件事?

示例:

A - >乙

B - > D,E,F,G

C - > d

在上面的例子中,我想踢A& C首先(并行),在成功执行之后,其他剩余的进程将等待B先完成。一旦B完成执行,所有其他进程就会开始。

P.S。:很抱歉,由于机密,我无法分享实际数据,尽管我试图用这个例子说清楚。

1 个答案:

答案 0 :(得分:1)

我非常喜欢使用进程和队列来做这样的事情。

像这样:

from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time

#example process functions
def processA(queueA, queueB):
    while True:
        try:
            data = queueA.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data
        queueB.put(data)

def processA(queueB, _):
    while True:
        try:
            data = queueB.get_nowait()
            if data == 'END':
                break
        except QueueEmpty:
            time.sleep(2) #wait some time for data to enter queue
            continue
        #do stuff with data

#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
    procs = []
    for _ in range(num_workers):
        p = Process(target=target_function, args=args)
        p.start()
        procs.append(p)
    return procs

def shutdown_process(proc_lst, queue):
    for _ in proc_lst:
        queue.put('END')
    for p in proc_lst:
        try:
            p.join()
        except KeyboardInterrupt:
            break

queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)

procsA = start_procs(number_of_workers, processA, (queueA, queueB)) 
procsB = start_procs(number_of_workers, processB, (queueB, None))  

# feed some data to processA
[queueA.put(data) for data in start_data]  

#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)

#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire