我有一个占用4 MB内存的pandas数据帧:
>>> df.memory_usage(deep=True).sum() / (1024 ** 2)
>>> 3.9931907653808594
当我将此数据帧保存到hdf5文件时,它在磁盘上占用的空间为21.1 MB:
>>> df.to_hdf('my_data.hdf5', key='df', format='table')
有人可以解释原因吗?
df.info()
返回:
<class 'pandas.core.frame.DataFrame'>
Float64Index: 1440 entries, 1262300460.0 to 1262386800.0
Columns: 879 entries,
dtypes: category(212), float32(667)
memory usage: 4.0 MB
df.dtypes.unique()
返回:
array([dtype('float32'),
CategoricalDtype(categories=['Closing', 'Off'], ordered=False),
CategoricalDtype(categories=['Active'], ordered=False),
CategoricalDtype(categories=['-'], ordered=False),
CategoricalDtype(categories=[], ordered=False),
CategoricalDtype(categories=['Active', 'Off'], ordered=False),
CategoricalDtype(categories=['Low', 'Off'], ordered=False),
CategoricalDtype(categories=['Low'], ordered=False)], dtype=object)
编辑7.24.2018,回复@jpp
这是将分类列转换为整数数据类型的方法:
for col in df.columns[df.dtypes == 'category']:
df[col] = df[col].cat.codes