Question

我使用pandas和hdf5文件来处理大量数据（例如10GB或更多）。我想使用表格格式，以便在阅读时能够有效地查询数据。但是，当我想将数据写入hdf存储（使用DataFrame.to_hdf（））时，它会产生巨大的内存开销。请考虑以下示例：

import pandas as pd
import numpy as np
from random import sample

nrows   = 1000000
ncols   = 500

# create a big dataframe
df = pd.DataFrame(np.random.rand(nrows,ncols)) 

# 1) Desired table format: uses huge memory overhead
df.to_hdf('test.hdf', 'random', format='table') # use lots of additional memory

# 2) Fixed format,  uses no additional memory
df.to_hdf('test2.hdf', 'random')

当我制作df.info（）时，我发现存储的大小为3.7GB。使用表格格式执行第一版时，系统的内存使用量突然增加大约8.5GB，这是我的DataFrame大小的两倍多。另一方面，对于具有固定格式的版本2，没有额外的内存开销。

有谁知道问题是什么，或者我如何防止这种巨大的内存开销？如果我有大约10GB的更大数据帧，由于这种开销，我总是耗尽内存。我知道，在速度和其他方面，表格格式的表现不是很好，但我不明白为什么它应该使用这么多额外的内存（这足够超过2个完整的副本数据帧）。

如果anoyone对此问题有解释和/或解决方案，那就太好了。

谢谢，马库斯

Answer 1

In＆lt; = 0.15.2

In [1]: df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB
In [3]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 11029.49 MiB, increment: 7130.57 MiB

＆lt; = 0.15.2中的效率低，最终复制数据1-2次。你可以做到这一点，直到0.16.0（这是固定的）

In [9]: %cpaste
def f(df):
  g = df.groupby(np.arange(len(df))/10000)
  store = pd.HDFStore('test.h5',mode='w')
  for _, grp in g:
    store.append('df',grp, index=False)
  store.get_storer('df').create_index()

In [11]: %memit -r 1 f(df)
peak memory: 7977.26 MiB, increment: 4079.32 MiB

在0.16.0（即2015年3月的第3周），this PR之后

In [2]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 4669.21 MiB, increment: 794.57 MiB

写入hdf时的Pandas / Pytable内存开销

1 个答案: