Question

我正致力于理解Python中的多处理。目前，我正在努力提高对队列和流程的理解。

我尝试做的是迭代一些数据，发送它的块以供之前生成的工作函数分析。如下面的MWE所示，结果有时在工人有时间对其数据做出反应之前计算出来。什么是确保我的工人在我继续工作之前完成的好方法？我知道Pool.join（）方法 - 这里有类似的东西吗？我知道Pool.map可以在块中执行此操作，但似乎如果我在大文件上给它一个迭代器（这是最终目标），它仍然会尝试首先读取整个文件，而不是立即启动在大块上工作。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import multiprocessing as mp
import time
import queue

def worker(inqueue, outqueue, name = None):
    if name is None:
        name = mp.current_process().pid
    print("Spawned", name)

    while True:
        # Read data from input queue
        data = inqueue.get()

        # Kill worker if input is None
        if data is None:
            print("Killing", name)
            return None

        # Compute partial sum and put on output queue
        print(name, "got data:", data)
        partial_sum = sum(data)
        outqueue.put(partial_sum)

if __name__ == '__main__':
    numbers = range(1, 101)
    buffer_size = 7  # Number of items for each partial sum

    inqueue = mp.Queue()
    outqueue = mp.Queue()

    # Define and start processes
    processes = []
    for i in range(1,5):
        p = mp.Process(target = worker,
                       args = (inqueue, outqueue, "process %d" % i,))
        p.start()
        processes.append(p)

    # Run through numbers, periodically sending buffer contents to a worker
    buffer = []
    for num in numbers:
        buffer.append(num)
        if len(buffer) >= buffer_size:
            inqueue.put(buffer)
            buffer = []
        #

    # Send remaining contents of buffer to worker
    inqueue.put(buffer)

    # Kill all processes
    for _ in range(len(processes)):
        inqueue.put(None)

    # Compute running sum as long as output queue contains stuff
    remaining = True
    running = 0

    #time.sleep(1)  # Output is as expected if we sleep for 1 sec

    while remaining:
        try:
            temp = outqueue.get(False)
            running += temp
        except queue.Empty:
            remaining = False
        #

    print(running)  # 0 if no sleep. 5050 if sleep.

Answer 1

注意：您正在使用此句子：# Kill all processes

您正在做的是不查杀 Process。
您在受控点优雅地停止 Process！杀戮意味着在不可预测的执行点中断，不推荐。

问题：......确保我的员工完成

使用Process.is_alive()方法执行此操作。

while any([p.is_alive() for p in processes]):
    time.sleep(0.2)

multiprocessing.Process
is_alive()
返回进程是否存活。粗略地说，从start（）方法返回到子进程终止的那一刻，进程对象仍处于活动状态。

Answer 2

您通常可以利用Pool.map或Pool.imap来简化此类代码，并避免自行管理队列中和队列外的需要。

一个简单的例子：

from multiprocessing import Pool

def work_items(n, step):
    for i in range(0, n, step):
        yield range(i, min(n, i + step))

def worker(item):
    return sum(item)

with Pool(4) as pool:
    total = sum(pool.imap(worker, work_items(101, 7)))
    print(total)   # 5050

Answer 3

我认为您不应该放弃尝试使用multiprocessing.Pool因为可以将工作异步并递增地发送到其中的进程，并且它允许处理要简化。

以下是根据您问题中的代码使用一个示例。请注意使用multiprocessing.Event告诉Pool进程退出 - 您可能也可以使用这种技术来解决当前方法的问题。

import multiprocessing as mp
import time
import queue

def worker(args):
    event, inqueue, outqueue, name = args
    print("{} started".format(name))

    while not event.is_set():  # not stopped
        data = inqueue.get()  # Read data from input queue
        print(name, "got data:", data)
        # Compute and put partial sum on output queue
        partial_sum = sum(data)
        outqueue.put(partial_sum)

if __name__ == '__main__':

    CHUNK_SIZE = 7  # Number of items for each partial sum
    NUM_PROCESSES = 4
    numbers = range(1, 101)  # Data to process.

    mgr = mp.Manager()
    inqueue = mgr.Queue()
    outqueue = mgr.Queue()
    event = mgr.Event()

    # Create and start the processes in a processing pool
    pool = mp.Pool(processes=NUM_PROCESSES)
    args = [(event, inqueue, outqueue, "Process %d" % (i+1,))
                for i in range(NUM_PROCESSES)]
    pool.map_async(worker, args)
    pool.close()

    # Put numbers to process into the work queue in chunks
    for i in range(0, len(numbers), CHUNK_SIZE):
        chunk = list(numbers[i: i+CHUNK_SIZE])
        print('putting data:', chunk)
        inqueue.put(chunk)

    while not inqueue.empty():  # All data processed?
        time.sleep(.001)
    event.set()  # signal all data processed
    pool.terminate()

    # Total all the values in output queue
    final_sum = 0
    while True:
        try:
            temp = outqueue.get_nowait()
            final_sum += temp
        except queue.Empty:
            break

    print('final sum:', final_sum)  # 5050 if correct

典型测试运行的输出：

putting data: [1, 2, 3, 4, 5, 6, 7]
putting data: [8, 9, 10, 11, 12, 13, 14]
putting data: [15, 16, 17, 18, 19, 20, 21]
putting data: [22, 23, 24, 25, 26, 27, 28]
putting data: [29, 30, 31, 32, 33, 34, 35]
putting data: [36, 37, 38, 39, 40, 41, 42]
putting data: [43, 44, 45, 46, 47, 48, 49]
putting data: [50, 51, 52, 53, 54, 55, 56]
putting data: [57, 58, 59, 60, 61, 62, 63]
putting data: [64, 65, 66, 67, 68, 69, 70]
putting data: [71, 72, 73, 74, 75, 76, 77]
putting data: [78, 79, 80, 81, 82, 83, 84]
putting data: [85, 86, 87, 88, 89, 90, 91]
putting data: [92, 93, 94, 95, 96, 97, 98]
putting data: [99, 100]
Process 1 started
Process 2 started
Process 1 got data: [1, 2, 3, 4, 5, 6, 7]
Process 1 got data: [8, 9, 10, 11, 12, 13, 14]
Process 1 got data: [15, 16, 17, 18, 19, 20, 21]
Process 2 got data: [22, 23, 24, 25, 26, 27, 28]
Process 2 got data: [29, 30, 31, 32, 33, 34, 35]
Process 1 got data: [36, 37, 38, 39, 40, 41, 42]
Process 3 started
Process 2 got data: [43, 44, 45, 46, 47, 48, 49]
Process 1 got data: [50, 51, 52, 53, 54, 55, 56]
Process 2 got data: [57, 58, 59, 60, 61, 62, 63]
Process 3 got data: [64, 65, 66, 67, 68, 69, 70]
Process 1 got data: [71, 72, 73, 74, 75, 76, 77]
Process 2 got data: [78, 79, 80, 81, 82, 83, 84]
Process 4 started
Process 1 got data: [85, 86, 87, 88, 89, 90, 91]
Process 2 got data: [92, 93, 94, 95, 96, 97, 98]
Process 3 got data: [99, 100]
final sum: 5050

在继续之前，如何确保所有进程都写入输出队列？

3 个答案: