Question

我有一个函数，希望使用多处理并行计算。该函数接受一个参数，但是还从两个已经加载到内存的非常大的数据帧中加载子集（其中一个约为1G，另一个刚刚超过6G）。

largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')

def f(x):
    load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
    load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
    #some computation happens here
    new_data.to_csv(directory + 'output.csv', index = False)

def main():
    multiprocessing.set_start_method('spawn', force = True)
    pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
    input = input_data['col']
    pool.map_async(f, input)
    pool.close()
    pool.join()

问题在于文件太大，当我在多个内核上运行它们时，我遇到了内存问题。我想知道是否有一种方法可以在所有进程之间共享加载的文件。

我已经尝试过manager（），但是无法正常工作。任何帮助表示赞赏。谢谢。

Answer 1

如果您是在类似UNIX的系统上运行此程序（默认情况下使用fork启动方法），则数据将立即共享。大多数操作系统使用“写时复制”存储页面。因此，即使您多次fork执行一个进程，只要您不修改这些数据帧，它们将共享大多数包含该数据帧的内存页。

但是，当使用spawn启动方法时，每个工作进程都必须加载数据帧。在这种情况下，我不确定操作系统是否足够智能以共享内存页面。或者实际上，这些产生的进程都将具有相同的内存布局。

我能想到的唯一的便携式解决方案是将数据保留在磁盘上，并在工作进程中使用mmap将其映射到内存中只读 >。这样，操作系统会注意到多个进程正在映射同一个文件，并且只会加载一个副本。

不利的一面是，数据将以磁盘csv格式存储在内存中，这使得从中读取数据（不进行复制！）变得不方便。因此，您可能需要预先将数据准备成易于使用的形式。像将'FirstRow'中的数据转换为float或double的二进制文件，您可以使用struct.iter_unpack进行迭代。

下面的功能（来自我的statusline脚本）使用mmap来计算邮箱文件中的邮件数量。

def mail(storage, mboxname):
    """
    Report unread mail.
    Arguments:
        storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
            This dict will be *modified* by this function.
        mboxname (str): name of the mailbox to read.
    Returns: A string to display.
    """
    stats = os.stat(mboxname)
    if stats.st_size == 0:
        return 'Mail: 0'
    # When mutt modifies the mailbox, it seems to only change the
    # ctime, not the mtime! This is probably releated to how mutt saves the
    # file. See also stat(2).
    newtime = stats.st_ctime
    newsize = stats.st_size
    if not storage or newtime > storage['time'] or newsize != storage['size']:
        with open(mboxname) as mbox:
            with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
                start, total = 0, 1  # First mail is not found; it starts on first line...
                while True:
                    rv = mm.find(b'\n\nFrom ', start)
                    if rv == -1:
                        break
                    else:
                        total += 1
                        start = rv + 7
                start, read = 0, 0
                while True:
                    rv = mm.find(b'\nStatus: R', start)
                    if rv == -1:
                        break
                    else:
                        read += 1
                        start = rv + 10
        unread = total - read
        # Save values for the next run.
        storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
    else:
        unread = storage['unread']
    return f'Mail: {unread}'

在这种情况下我使用mmap是因为它比仅读取文件快4倍。参见normal reading与using mmap。

与multiprocessing.pool共享大型数据框

1 个答案: