当从磁盘读取镶木地板文件并转换为pandas数据框时,当对象超出范围时,分配的内存似乎不会被释放。任何想法为什么会这样?下面的代码段显示了此问题:
import resource
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# write df to parquet
def write_parquet(path):
df = pd.DataFrame({str(i): np.random.randn(10000000) for i in range(10)})
table = pa.Table.from_pandas(df)
pq.write_table(table, path)
df = None
def read_parquet(path):
pq.read_table(path, nthreads=1).to_pandas()
if __name__ == "__main__":
write_parquet("test.parquet")
for i in range(5): # read in a loop
read_parquet("test.parquet")
print((f"Iteration {i}|MaximumResidentSetSize: "
f"{resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000}"))
$ python replicate.py
Iteration 0|MaximumResidentSetSize: 1653.36
Iteration 1|MaximumResidentSetSize: 2434.932
Iteration 2|MaximumResidentSetSize: 3216.204
Iteration 3|MaximumResidentSetSize: 3997.276
Iteration 4|MaximumResidentSetSize: 4778.736
这种情况发生在Linux上,pyarrow是0.3.0版本,使用conda(conda-forge build)安装。