我正在尝试根据我的要求执行并行处理,并且代码似乎可以并行处理4k-5k元素。但是,一旦要处理的元素开始增加,代码就会处理一些清单,然后在不引发任何错误的情况下,程序突然停止运行。
我检查了程序是否未挂起,RAM可用(我有16 Gb RAM),CPU利用率甚至不到30%。似乎无法弄清楚正在发生什么。我有100万个元素要处理。
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
def start_download_process():
multiproc_pool = multiprocessing.Pool(processes=10)
for download_item in get_items_to_download():
multiproc_pool.apply_async(start_processing, args = (download_item, ), callback = results_callback)
multiproc_pool.close()
multiproc_pool.join()
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def results_callback(result):
print(result)
if __name__ == "__main__":
start_download_process()
更新-
发现错误-BrokenPipeError:[Errno 32]管道损坏
跟踪-
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
答案 0 :(得分:0)
代码看起来正确。我唯一能想到的是,您的所有进程都在挂起,等待完成。这是一个建议:与其使用apply_async
提供的回调机制,不如使用返回的AsyncResult
对象来从进程中获取返回值。您可以在此对象上调用get
,以指定超时值(以下任意指定30秒,可能不够长)。如果任务没有在该持续时间内完成,则将引发超时异常(如果需要,可以捕获它)。但是,这将检验进程正在停止的假设。只要确保指定的超时值足够大,任务就可以在该时间段内完成。我也将任务提交分成了1000个批次,不是因为我认为1,000,000的大小本身是一个问题,而是因为您没有1,000,000个结果对象的列表。但是,如果发现结果不再挂起,请尝试增加批处理大小,看看是否确实有所作为。
import multiprocessing
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
BATCH_SIZE = 1000
def start_download_process():
with multiprocessing.Pool(processes=10) as multiproc_pool:
results = []
for download_item in get_items_to_download():
results.append(multiproc_pool.apply_async(start_processing, args = (download_item, )))
if len(results) == BATCH_SIZE:
process_results(results)
results = []
if len(results):
process_results(results)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
TIMEOUT_VALUE = 30 # or some suitable value
def process_results(results):
for result in results:
return_value = result.get(TIMEOUT_VALUE) # will cause an exception if process is hanging
print(return_value)
if __name__ == "__main__":
start_download_process()
更新
基于谷歌搜索几页有关管道破裂的错误,看来您的错误可能是内存耗尽的结果。例如,请参见Python Multiprocessing: Broken Pipe exception after increasing Pool size。下面的重做尝试以利用更少的内存。如果可行,您可以尝试增加批量大小:
import multiprocessing
BATCH_SIZE = 1000
POOL_SIZE = 10
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
def start_download_process():
with multiprocessing.Pool(processes=POOL_SIZE) as multiproc_pool:
items = []
for download_item in get_items_to_download():
items.append(download_item)
if len(items) == BATCH_SIZE:
process_items(multiproc_pool, items)
items = []
if len(items):
process_items(multiproc_pool, items)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def compute_chunksize(iterable_size):
if iterable_size == 0:
return 0
chunksize, extra = divmod(iterable_size, POOL_SIZE * 4)
if extra:
chunksize += 1
return chunksize
def process_items(multiproc_pool, items):
chunksize = compute_chunksize(len(items))
# you must iterate the iterable returned:
for return_value in multiproc_pool.imap(start_processing, items, chunksize):
print(return_value)
if __name__ == "__main__":
start_download_process()
答案 1 :(得分:0)
def get_items_to_download():
#instead of yield, return the complete generator object to avoid iterating over this function.
#Return type - generator (download_item1, download_item2...)
return download_item
def start_download_process():
download_item = get_items_to_download()
# specify the chunksize to get faster results.
with multiprocessing.Pool(processes=10) as pool:
#map_async() is also available, if that's your use case.
results= pool.map(start_processing, download_item, chunksize=XX )
print(results)
return(results)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def results_callback(result):
print(result)
if __name__ == "__main__":
start_download_process()
答案 2 :(得分:0)
我在 Linux 上使用 Python 3.8
也有同样的体验。我使用 Python 3.7
设置了一个新环境,multiprocessing.Pool()
现在可以正常工作了。