Question

我正在启动一个Process，该Process从数据库中将几笔数据提取到具有日期索引的DataFrame中。从那里我创建一个Manager来存储该数据，并使用Pool调用一个函数来利用CPU内核。由于数据量很大，因此需要使用共享内存来使池化方法起作用。

import multiprocessing as mp
import pandas as pd
# Launched from __main__
class MyProcess(mp.Process):
    def calculate():
        # Get and preprocess data
        allMyData = <fetch from Mongo>
        aggs = <resample data into multiple aggregations>
        manager = mp.Manager()
        aggregateData = manager.dict()
        for key, value in aggs.items():
            aggregateData[key] = value

        # Setup the pool and methods
        NUM_PROCESSES = 16
        pool = mp.Pool(processes=NUM_PROCESSES)
        procs = []
        for thing in thousandsOfThingsIWantToCalculate:
            # Run the method asynchronously
            proc = pool.apply_async(method, args=(aggregateData))
            procs.append({proc:[]})

            # Wait so all of the pool methods are not loaded created in memory at once, without this there was a different memory problem
            while len(pool._cache) > NUM_PROCESSES:
                sleep(0.01)

        for dict in procs:
            p, res = next(item(dict.items()))
            res = p.get()
        pool.close()
        pool.join()
        # Do stuff with results

def method(data):
    ...
    # Loop through all the data by row
    for row in data.itertuples():
        # Can be empty
    ...

最终，我将收到此错误，并且运行池的进程将退出。这发生在“ for ... 中的事物”循环完成之前。

OSError: [Errno 12] Cannot allocate memory

我已将所有内容缩小到方法的这一行，我相信这是在行变量中创建内存副本的，该副本未得到有效释放。如果我注释掉该行，则其他内存将被消耗，但会由池化方法释放。如果我在生产线上运行，即使车身空了，之后又什么也不做，则内存将被完全消耗。

for row in data.itertuples():

该行实际上是数据框数据的副本吗？多余的内存消耗？ 更新：是的，还会复制保留次数增加的任何内容（即本地变量）
我在其他地方注意到了创建局部变量会显示内存故障并访问数据需要时直接从共享内存结构中获取。我如何遍历数据框而不创建局部变量？我真的无法向量化 itertuples循环，因为它本质上是顺序的（下一次迭代依靠以前的结果。

谢谢您的时间。

更新：我目前可以通过减少进程数（16个是压力测试）并在Pool上设置 maxtasksperchild 来解决此问题。我之前曾尝试过该属性，然后看到CPU速度变慢（启动时，现在考虑到现在正在复制内存），所以我将其报废了。现在，CPU看起来很正常。

pool = mp.Pool(processes=NUM_PROCESSES, maxtasksperchild=1)

类似的行为也可以通过在循环完成后进行垃圾收集来实现，但是由于这种行为不太理想，因此我会坚持使用 maxtasksperchild 。

随着进程循环，我看到了相当大的内存消耗，但是当进程完成时，系统将其回收。显然，随着我的数据集的增加和我的内存的增加，此解决方案最终将无效。 所以问题仍然存在，有没有办法在不复制内存的情况下遍历DataFrame？

您可以在不复制内存的情况下迭代DataFrame吗？

0 个答案: