Question

我想从memmap（来自共享内存）的ndarray支持一个相对较大的Pandas DataFrame。我有代码（下面），但是当我在数据框上运行计算时，整体系统使用情况（按顶部测量）会上升，就像进程正在复制数据一样。计算完成后，系统内存使用量将返回基线。如果我直接使用memmap进行相同的计算，则系统内存使用量不会增加。有没有办法避免内存使用的这种（显然）临时峰值？

（注意，对于这两种情况，由top报告的单个python进程使用的内存百分比上升，FWIW。）

使用Pandas 0.20.3，numpy 1.13.1，python 2.7.11

代码 - 第一个脚本example_setup.py，设置共享内存memmap：

import numpy

N = 7300000000  #this large N makes it really obvious on top what is happening
memmap_file = "/tmp/hello_world.bin"

progress_mod = 10000000
print N/progress_mod

if __name__ == "__main__":
    print "opening memmap_file:  {}".format(memmap_file)
    my_mm = numpy.memmap(memmap_file, dtype="float32", mode="w+", shape=(N,))

    print "writing to memmap file integers - N:  {}".format(N)
    for i in xrange(N):
        my_mm[i] = float(i)

        if (i%progress_mod) == 0:
            print "progress i:  {}".format(i)

    raw_input("pause here to allow other processes to use shared memory")

第二个脚本example_use.py，直接使用上面的memmap，并作为Pandas DataFrame的支持：

import example_setup
import numpy
import pandas

if __name__ == "__main__":
    memmap_file = example_setup.memmap_file
    N = example_setup.N

    print "opening memmap_file:  {}".format(memmap_file)
    my_mm = numpy.memmap(example_setup.memmap_file, dtype="uint32", mode="r", shape=(N,))

    print "calculate mean of my_mm, monitor memory using top.  This process will show increase usage, but system usage will not increase"
    my_mean = my_mm.mean()
    print "my_mean:  {}".format(my_mean)
    raw_input("pause here before doing the above with a dataframe backed by my_mm")

    df = pandas.DataFrame(my_mm, copy=False)
    print """calculate mean of pandas DataFrame df, monitor memory using top.  Both this process and the system usage will increase.  
        When the calculation finishes, system usage will return to baseline"""
    my_df_mean = df.mean()
    print "my_df_mean:  {}".format(my_df_mean)
    raw_input("pause here before exiting")

由numpy memmap ndarray支持的Pandas DataFrame似乎在计算期间复制数据

0 个答案: