我有一个大约400k行和5000列的pandas DataFrame。所有数据都是float64,并且非常稀疏(大多数单元格为0)。
列也是多索引的。
我试图将这个数据帧保存在HDF5中,如下所示:
# Obtain the 400k x 5000 DF
df = pd.concat([df1, df2], axis=1, join='inner')
store = pd.HDFStore("store.h5")
store['ComboDF'] = df
我遇到内存错误(使用64位python 3.5,我也有32 GB内存。我已经看到Spyder控制台的这个错误,以及当我运行这个脚本时一个独立的。)
内存错误如下所示:
> C:\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\start_ipython_kernel.py:1:
> PerformanceWarning: your performance may suffer as PyTables will
> pickle object types that it cannot map directly to c-types
> [inferred_type->mixed,key->axis0] [items->None]
>
> # -*- coding: utf-8 -*- Traceback (most recent call last):
>
> File "<ipython-input-17-8e4075345173>", line 1, in <module>
> store['Combined'] = df
>
> File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line
> 420, in __setitem__
> self.put(key, value)
>
> File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line
> 826, in put
> self._write_to_group(key, value, append=append, **kwargs)
>
> File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line
> 1264, in _write_to_group
> s.write(obj=value, append=append, complib=complib, **kwargs)
>
> File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line
> 2812, in write
> self.write_array('block%d_values' % i, blk.values, items=blk_items)
>
> File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line
> 2592, in write_array
> self._handle.create_array(self.group, key, value)
>
> File "C:\Anaconda3\lib\site-packages\tables\file.py", line 1152, in
> create_array
> obj=obj, title=title, byteorder=byteorder)
>
> File "C:\Anaconda3\lib\site-packages\tables\array.py", line 188, in
> __init__
> byteorder, _log)
>
> File "C:\Anaconda3\lib\site-packages\tables\leaf.py", line 262, in
> __init__
> super(Leaf, self).__init__(parentnode, name, _log)
>
> File "C:\Anaconda3\lib\site-packages\tables\node.py", line 267, in
> __init__
> self._v_objectid = self._g_create()
>
> File "C:\Anaconda3\lib\site-packages\tables\array.py", line 197, in
> _g_create
> nparr = array_as_internal(self._obj, flavor)
>
> File "C:\Anaconda3\lib\site-packages\tables\flavor.py", line 178, in
> array_as_internal
> return array_of_flavor2(array, src_flavor, internal_flavor)
>
> File "C:\Anaconda3\lib\site-packages\tables\flavor.py", line 131, in
> array_of_flavor2
> return convfunc(array)
>
> File "C:\Anaconda3\lib\site-packages\tables\flavor.py", line 371, in
> conv_to_numpy
> nparr = nparr.copy() # copying the array makes it contiguous
>
> MemoryError
我还尝试重新采样我的数据集,并遇到以下问题:
sample = df.sample(100)
Traceback (most recent call last):
File "<ipython-input-23-0c332df5fd61>", line 1, in <module>
df.sample(n=10)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2573, in sample
return self.take(locs, axis=axis, is_copy=False)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1630, in take
self._consolidate_inplace()
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2729, in _consolidate_inplace
self._protect_consolidate(f)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2718, in _protect_consolidate
result = f()
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2727, in f
self._data = self._data.consolidate()
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3273, in consolidate
bm._consolidate_inplace()
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3278, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4269, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4289, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4335, in _vstack
return np.vstack(to_stack)
File "C:\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 230, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
我显然做了一些愚蠢的事,但我不知道那可能是什么。大熊猫/ HDF5专家可以提供建议吗?
编辑如果我选择一列(例如100),那么一切正常。
答案 0 :(得分:1)
您可以尝试以下方法:
df.columns = list(map(lambda x: prefix + str(x), df.columns))
并尝试再次保存文件。
store = pd.HDFStore('file.h5')
store['mydf'] = df
来自熊猫文档:
已将IntIndex添加到Int64Index子类中,以支持常见用例的内存保存替代方法。这与python范围对象(python 2中的xrange)具有类似的实现,因为它只存储索引的start,stop和step值。它将透明地与用户API交互,并在需要时转换为Int64Index。
IIRC,HDF5的元数据总量限制为64kb。 PyTables可能被迫将范围索引的列标题保存为64位整数,这显然有点过分,并且考虑到格式必须存储的列数+其他元数据,这可能是导致错误的原因。我有4096个列表的类似问题,我转换为简单的短字符串。
答案 1 :(得分:0)
这是我想尝试的(目前无法测试):
unstack
然后当你把它读回到pandas时,你可以设置索引,fillna(0)
数据帧,然后data = (
load_hdf5_data("storm.h5")['ComboDF'] # or however you do this
.set_index(index_columns)
.unstack(level='columns')
.fillna(0)
)
来恢复你的所有零。
{{1}}