如何跟踪python多处理池并在每次X迭代后运行函数?

时间:2018-12-07 03:44:40

标签: python multiprocessing

我有一些简单的python多处理代码,如下所示:

files = ['a.txt', 'b.txt', 'c.txt', etc..]

def convert_file(file):
  do_something(file) 

mypool = Pool(number_of_workers)
mypool.map(convert_file, files)

我有100,000s个文件要由convert_file转换,并且希望运行这样的功能:我每20个转换后的文件上传到服务器,而无需等待所有文件转换。我将如何去做?

2 个答案:

答案 0 :(得分:2)

对于多处理,您在如何处理单个作业中发生的异常方面有一个小问题。如果使用map变体,则在轮询结果时需要小心,否则如果强制map函数引发异常,则可能会丢失一些结果。此外,除非您对工作中的任何异常进行特殊处理,否则您甚至都不知道是哪个问题。如果您使用apply变体,则在获取结果时无需格外小心,但是对结果进行整理变得有些棘手。

总的来说,我认为map最容易上班。

首先,您需要一个特殊的异常,该异常不能在您的主模块中创建,否则Python将无法正确地对其进行序列化和反序列化。

例如

custom_exceptions.py

class FailedJob(Exception):
    pass

main.py

from multiprocessing import Pool
import time
import random

from custom_exceptions import FailedJob


def convert_file(filename):
    # pseudo implementation to demonstrate what might happen
    if filename == 'file2.txt':
        time.sleep(0.5)
        raise Exception
    elif filename =='file0.txt':
        time.sleep(0.3)
    else:
        time.sleep(random.random())
    return filename  # return filename, so we can identify the job that was completed


def job(filename):
    """Wraps any exception that occurs with FailedJob so we can identify which job failed 
    and why""" 
    try:
        return convert_file(filename)
    except Exception as ex:
        raise FailedJob(filename) from ex


def main():
    chunksize = 4  # number of jobs before dispatch
    total_jobs = 20
    files = list('file{}.txt'.format(i) for i in range(total_jobs))

    with Pool() as pool:
        # we use imap_unordered as we don't care about order, we want the result of the 
        # jobs as soon as they are done
        iter_ = pool.imap_unordered(job, files)
        while True:
            completed = []
            while len(completed) < chunksize:
                # collect results from iterator until we reach the dispatch threshold
                # or until all jobs have been completed
                try:
                    result = next(iter_)
                except StopIteration:
                    print('all child jobs completed')
                    # only break out of inner loop, might still be some completed
                    # jobs to dispatch
                    break
                except FailedJob as ex:
                    print('processing of {} job failed'.format(ex.args[0]))
                else:
                    completed.append(result)

            if completed:
                print('completed:', completed)
                # put your dispatch logic here

            if len(completed) < chunksize:
                print('all jobs completed and all job completion notifications'
                   ' dispatched to central server')
                return


if __name__ == '__main__':
    main()

答案 1 :(得分:0)

您可以在整个进程中使用一个共享变量,以跟踪转换后的文件。您可以找到一个示例here

当进程要读取和写入时,变量将自动锁定。在锁定期间,所有其他想要访问该变量的进程都必须等待。因此,您可以在主循环中轮询变量,并检查其是否大于20,同时转换过程将使变量递增。值超过20后,您将重置该值并将文件写入服务器。