Question

我正在使用大约1,700万行* 5列的大熊猫DataFrame。

然后我在此DataFrame中的移动窗口上运行回归。我正在尝试通过将（静态）DataFrame传递给多个进程来并行化此计算密集型部分。

为简化示例，我假设我只是在DataFrame上工作1000次：

import multiprocessing as mp
def helper_func(input_tuple):
    # Runs a regression on the DataFrame and outputs the
    # results (coefficients)
    ...

if __name__ == '__main__':
    # For simplicity's sake let's assume the input tuple is
    # just a list of copies of the DataFrame
    input_tuples = [df for x in range(1000)]
    pl = mp.Pool(10)
    jobs = pl.map_async(helper_func, list_of_input_tuples)
    pl.close()
    result = jobs.get()

在重复运行时跟踪资源使用情况时，我注意到内存在不断增加，并且在进程完成之前就达到了100％。完成后，它将重置为运行代码之前的状态。

要给出实际数字，我可以看到父进程大约使用了450mb，而每个工作进程都使用了大约1-2 GB的内存。

我担心（可能不必要）这可能会出现与内存相关的问题。有没有办法减少子进程持有的内存？对我来说还不清楚为什么它们会不断增加（并且比容纳DataFrame的父进程大得多）。

编辑： 尝试了其他解决方法，例如设置maxtasksperchild（每个High Memory Usage Using Python Multiprocessing都没有成功）

Edit2： 以下示例显示了运行时的内存使用情况。有小的高峰和低谷（假设内存被释放了吗？），但是在代码运行的最后，它确实达到了100％。

具有大型共享对象的多处理池

0 个答案: