Parquet to_pandas内存泄漏

时间:2017-05-13 23:03:22

标签: python pandas parquet

当从磁盘读取镶木地板文件并转换为pandas数据框时,当对象超出范围时,分配的内存似乎不会被释放。任何想法为什么会这样?下面的代码段显示了此问题:

import resource
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# write df to parquet
def write_parquet(path):
    df = pd.DataFrame({str(i): np.random.randn(10000000) for i in range(10)})
    table = pa.Table.from_pandas(df)
    pq.write_table(table, path)
    df = None


def read_parquet(path):
    pq.read_table(path, nthreads=1).to_pandas()


if __name__ == "__main__":
    write_parquet("test.parquet")
    for i in range(5):  # read in a loop
        read_parquet("test.parquet")
        print((f"Iteration {i}|MaximumResidentSetSize: "
               f"{resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000}"))


$ python replicate.py
Iteration 0|MaximumResidentSetSize: 1653.36
Iteration 1|MaximumResidentSetSize: 2434.932
Iteration 2|MaximumResidentSetSize: 3216.204
Iteration 3|MaximumResidentSetSize: 3997.276
Iteration 4|MaximumResidentSetSize: 4778.736

这种情况发生在Linux上,pyarrow是0.3.0版本,使用conda(conda-forge build)安装。

0 个答案:

没有答案