在ProcessPoolExecutor中打开pandas数据框时,永远不会释放内存

时间:2016-08-22 16:09:32

标签: python pandas multiprocessing concurrent.futures

以下是我正在使用的代码的一个非常简单的示例...

from concurrent.futures import ProcessPoolExecutor
import pandas


if __name__ == "__main__":

    def i_use_lots_of_memory():
        print 'doing something that uses a lot of memory'
        data = pandas.read_csv('large_txt_file.txt')
        del data
        # do other things here as soon as I've solved mem usage issues
        print 'ha ha  I used up a ton of memory.'

    def simplest_callback_ever(future):
        _ = future.result()
        print 'callback was run'


    class ManagesFileReading(object):
        def __init__(self):
            self.pool = ProcessPoolExecutor(max_workers=24)

        def add_job(self, callback=None):
            future = self.pool.submit(i_use_lots_of_memory)
            if callback:
                future.add_done_callback(callback)


    mfr = ManagesFileReading()
    mfr.add_job(simplest_callback_ever)

在这个例子中,我打开一个800MB的文本文件,占用大约2GB的内存。输出是......

doing something that uses a lot of memory
ha ha  I used up a ton of memory.
callback function was run. Task is complete.

所以任务完成,问题是内存永远不会释放。即使未来已经完成,它也永远不会释放内存。我可以释放它的唯一方法是通过运行self._pool.shutdown()

来关闭进程池

除非我误解了ProcessPoolExecutor是如何工作的,否则当回调函数完成时意味着任务已经完成,对吧?为什么未来被删除以及内存被释放?有什么想法吗?

0 个答案:

没有答案