使用Concurrent Futures而不会耗尽RAM

时间:2016-01-13 15:10:57

标签: python python-3.x memory-management parallel-processing

我做了一些文件解析,这是一个CPU绑定的任务。无论我在这个过程中抛出多少文件,它都使用不超过大约50MB的RAM。 该任务是可以并行的,我已将其设置为使用下面的并发期货来解析每个文件作为一个单独的过程:

    from concurrent import futures
    with futures.ProcessPoolExecutor(max_workers=6) as executor:
        # A dictionary which will contain a list the future info in the key, and the filename in the value
        jobs = {}

        # Loop through the files, and run the parse function for each file, sending the file-name to it.
        # The results of can come back in any order.
        for this_file in files_list:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)

问题在于,当我使用期货,RAM使用火箭运行时,不久我已经用完了,Python崩溃了。这可能在很大程度上是因为parse_function的结果大小为几MB。一旦结果通过post_processing,应用程序就不再需要它们了。正如您所看到的,我尝试del jobs[job]清除jobs中的项目,但这没有任何区别,内存使用率保持不变,并且似乎以相同的速率增加。

我也确认这不是因为它仅仅使用一个进程等post_process函数,而是投入time.sleep(1)

关于内存管理的期货文档中没有任何内容,虽然简短的搜索表明它已经出现在真实的期货应用程序(Clear memory in python loophttp://grokbase.com/t/python/python-list/1458ss5etz/real-world-use-of-concurrent-futures)之前 - 答案不能翻译成我的用例(他们都关注超时等)。

那么,如何在不耗尽RAM的情况下使用并发期货? (Python 3.5)

2 个答案:

答案 0 :(得分:7)

我会开枪(可能是错误的猜测......)

您可能需要一点一点地提交您的工作,因为在每次提交时您都会制作一份parser_variables副本,这可能最终会咀嚼您的RAM。

这是有趣的部分

上带有“< ----”的代码
with futures.ProcessPoolExecutor(max_workers=6) as executor:
    # A dictionary which will contain a list the future info in the key, and the filename in the value
    jobs = {}

    # Loop through the files, and run the parse function for each file, sending the file-name to it.
    # The results of can come back in any order.
    files_left = len(files_list) #<----
    files_iter = iter(files_list) #<------

    while files_left:
        for this_file in files_iter:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file
            if len(jobs) > MAX_JOBS_IN_QUEUE:
                break #limit the job submission for now job

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            files_left -= 1 #one down - many to go...   <---

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)
            break; #give a chance to add more jobs <-----

答案 1 :(得分:0)

您可以尝试将del添加到您的代码中

for job in futures.as_completed(jobs):
    del jobs[job]
    del job #or job._result = None