Question

我正在尝试根据我的要求执行并行处理，并且代码似乎可以并行处理4k-5k元素。但是，一旦要处理的元素开始增加，代码就会处理一些清单，然后在不引发任何错误的情况下，程序突然停止运行。

我检查了程序是否未挂起，RAM可用（我有16 Gb RAM），CPU利用率甚至不到30％。似乎无法弄清楚正在发生什么。我有100万个元素要处理。

def get_items_to_download():
    #iterator to fetch all items that are to be downloaded
    yield download_item

def start_download_process():
    multiproc_pool = multiprocessing.Pool(processes=10)
    for download_item in get_items_to_download():
        multiproc_pool.apply_async(start_processing, args = (download_item, ), callback = results_callback)
    
    multiproc_pool.close()
    multiproc_pool.join()

def start_processing(download_item):
    try:
        # Code to download item from web API
        # Code to perform some processing on the data
        # Code to update data into database
        return True
    except Exception as e:
        return False

def results_callback(result):
    print(result)

if __name__ == "__main__":
    start_download_process()

更新-

发现错误-BrokenPipeError：[Errno 32]管道损坏

跟踪-

Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Answer 1

代码看起来正确。我唯一能想到的是，您的所有进程都在挂起，等待完成。这是一个建议：与其使用apply_async提供的回调机制，不如使用返回的AsyncResult对象来从进程中获取返回值。您可以在此对象上调用get，以指定超时值（以下任意指定30秒，可能不够长）。如果任务没有在该持续时间内完成，则将引发超时异常（如果需要，可以捕获它）。但是，这将检验进程正在停止的假设。只要确保指定的超时值足够大，任务就可以在该时间段内完成。我也将任务提交分成了1000个批次，不是因为我认为1,000,000的大小本身是一个问题，而是因为您没有1,000,000个结果对象的列表。但是，如果发现结果不再挂起，请尝试增加批处理大小，看看是否确实有所作为。

import multiprocessing

def get_items_to_download():
    #iterator to fetch all items that are to be downloaded
    yield download_item

BATCH_SIZE = 1000

def start_download_process():
    with multiprocessing.Pool(processes=10) as multiproc_pool:
        results = []
        for download_item in get_items_to_download():
            results.append(multiproc_pool.apply_async(start_processing, args = (download_item, )))
            if len(results) == BATCH_SIZE:
                process_results(results)
                results = []
        if len(results):
            process_results(results)
    

def start_processing(download_item):
    try:
        # Code to download item from web API
        # Code to perform some processing on the data
        # Code to update data into database
        return True
    except Exception as e:
        return False

TIMEOUT_VALUE = 30 # or some suitable value

def process_results(results):
    for result in results:
        return_value = result.get(TIMEOUT_VALUE) # will cause an exception if process is hanging
        print(return_value)

if __name__ == "__main__":
    start_download_process()

更新

基于谷歌搜索几页有关管道破裂的错误，看来您的错误可能是内存耗尽的结果。例如，请参见Python Multiprocessing: Broken Pipe exception after increasing Pool size。下面的重做尝试以利用更少的内存。如果可行，您可以尝试增加批量大小：

import multiprocessing


BATCH_SIZE = 1000
POOL_SIZE = 10


def get_items_to_download():
    #iterator to fetch all items that are to be downloaded
    yield download_item


def start_download_process():
    with multiprocessing.Pool(processes=POOL_SIZE) as multiproc_pool:
        items = []
        for download_item in get_items_to_download():
            items.append(download_item)
            if len(items) == BATCH_SIZE:
                process_items(multiproc_pool, items)
                items = []
        if len(items):
            process_items(multiproc_pool, items)


def start_processing(download_item):
    try:
        # Code to download item from web API
        # Code to perform some processing on the data
        # Code to update data into database
        return True
    except Exception as e:
        return False


def compute_chunksize(iterable_size):
    if iterable_size == 0:
        return 0
    chunksize, extra = divmod(iterable_size, POOL_SIZE * 4)
    if extra:
        chunksize += 1
    return chunksize


def process_items(multiproc_pool, items):
    chunksize = compute_chunksize(len(items))
    # you must iterate the iterable returned:
    for return_value in multiproc_pool.imap(start_processing, items, chunksize):
        print(return_value)


if __name__ == "__main__":
    start_download_process()

Answer 2

def get_items_to_download():
    #instead of yield, return the complete generator object to avoid iterating over this function.
    #Return type - generator (download_item1, download_item2...)
    return download_item


def start_download_process():
    download_item = get_items_to_download()
    # specify the chunksize to get faster results. 
    with multiprocessing.Pool(processes=10) as pool:
    #map_async() is also available, if that's your use case.
        results= pool.map(start_processing, download_item, chunksize=XX )  
    print(results)
    return(results)

def start_processing(download_item):
    try:
        # Code to download item from web API
        # Code to perform some processing on the data
        # Code to update data into database
        return True
    except Exception as e:
        return False

def results_callback(result):
    print(result)

if __name__ == "__main__":
    start_download_process()

Answer 3

我在 Linux 上使用 Python 3.8 也有同样的体验。我使用 Python 3.7 设置了一个新环境，multiprocessing.Pool() 现在可以正常工作了。

Python多处理池突然停止

3 个答案: