我有一些简单的python多处理代码,如下所示:
files = ['a.txt', 'b.txt', 'c.txt', etc..]
def convert_file(file):
do_something(file)
mypool = Pool(number_of_workers)
mypool.map(convert_file, files)
我有100,000s个文件要由convert_file
转换,并且希望运行这样的功能:我每20个转换后的文件上传到服务器,而无需等待所有文件转换。我将如何去做?
答案 0 :(得分:2)
对于多处理,您在如何处理单个作业中发生的异常方面有一个小问题。如果使用map
变体,则在轮询结果时需要小心,否则如果强制map
函数引发异常,则可能会丢失一些结果。此外,除非您对工作中的任何异常进行特殊处理,否则您甚至都不知道是哪个问题。如果您使用apply
变体,则在获取结果时无需格外小心,但是对结果进行整理变得有些棘手。
总的来说,我认为map
最容易上班。
首先,您需要一个特殊的异常,该异常不能在您的主模块中创建,否则Python将无法正确地对其进行序列化和反序列化。
例如
custom_exceptions.py
class FailedJob(Exception):
pass
main.py
from multiprocessing import Pool
import time
import random
from custom_exceptions import FailedJob
def convert_file(filename):
# pseudo implementation to demonstrate what might happen
if filename == 'file2.txt':
time.sleep(0.5)
raise Exception
elif filename =='file0.txt':
time.sleep(0.3)
else:
time.sleep(random.random())
return filename # return filename, so we can identify the job that was completed
def job(filename):
"""Wraps any exception that occurs with FailedJob so we can identify which job failed
and why"""
try:
return convert_file(filename)
except Exception as ex:
raise FailedJob(filename) from ex
def main():
chunksize = 4 # number of jobs before dispatch
total_jobs = 20
files = list('file{}.txt'.format(i) for i in range(total_jobs))
with Pool() as pool:
# we use imap_unordered as we don't care about order, we want the result of the
# jobs as soon as they are done
iter_ = pool.imap_unordered(job, files)
while True:
completed = []
while len(completed) < chunksize:
# collect results from iterator until we reach the dispatch threshold
# or until all jobs have been completed
try:
result = next(iter_)
except StopIteration:
print('all child jobs completed')
# only break out of inner loop, might still be some completed
# jobs to dispatch
break
except FailedJob as ex:
print('processing of {} job failed'.format(ex.args[0]))
else:
completed.append(result)
if completed:
print('completed:', completed)
# put your dispatch logic here
if len(completed) < chunksize:
print('all jobs completed and all job completion notifications'
' dispatched to central server')
return
if __name__ == '__main__':
main()
答案 1 :(得分:0)
您可以在整个进程中使用一个共享变量,以跟踪转换后的文件。您可以找到一个示例here
当进程要读取和写入时,变量将自动锁定。在锁定期间,所有其他想要访问该变量的进程都必须等待。因此,您可以在主循环中轮询变量,并检查其是否大于20,同时转换过程将使变量递增。值超过20后,您将重置该值并将文件写入服务器。