我在Pandas中有一个DataFrame:
In [7]: my_df
Out[7]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airplane to zoo
dtypes: float64(2659), object(2)
当我尝试将其保存到磁盘时:
store = pd.HDFStore(p_full_h5)
store.append('my_df', my_df)
我明白了:
File "H5A.c", line 254, in H5Acreate2
unable to create attribute
File "H5A.c", line 503, in H5A_create
unable to create attribute in object header
File "H5Oattribute.c", line 347, in H5O_attr_create
unable to create new attribute in header
File "H5Omessage.c", line 224, in H5O_msg_append_real
unable to create new message
File "H5Omessage.c", line 1945, in H5O_msg_alloc
unable to allocate space for message
File "H5Oalloc.c", line 1142, in H5O_alloc
object header message is too large
End of HDF5 error back trace
Can't set attribute 'non_index_axes' in node:
/my_df(Group) u''.
为什么呢?
注意:如果重要,DataFrame列名称是简单的小字符串:
In[12]: max([len(x) for x in list(my_df.columns)])
Out{12]: 47
Pandas 0.11以及IPython,Python和HDF5的最新稳定版本都是如此。
答案 0 :(得分:11)
对于列的所有元数据,HDF5的标头限制为64kb。这包括名称,类型等。当您大约2000列时,您将用尽空间来存储所有元数据。这是pytables的一个基本限制。我认为他们不会很快就会在他们身边制定变通方法。您将需要拆分表或选择其他存储格式。
答案 1 :(得分:5)
尽管此线程已使用5年以上,但问题仍然存在。仍然无法将具有2000列以上的DataFrame作为一个表保存到HDFStore中。如果要选择稍后从HDFStore读取哪些列,则不能使用format='fixed'
。
这是一个将DataFrame拆分为较小的并将其存储为单独表的函数。另外,将pandas.Series
放入HDFStore,其中包含一列所属的表的信息。
def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
"""Write a `pandas.DataFrame` with a large number of columns
to one HDFStore.
Parameters
-----------
filename : str
name of the HDFStore
data : pandas.DataFrame
data to save in the HDFStore
columns: list
a list of columns for storing. If set to `None`, all
columns are saved.
maxColSize : int (default=2000)
this number defines the maximum possible column size of
a table in the HDFStore.
"""
import numpy as np
from collections import ChainMap
store = pd.HDFStore(filename, **kwargs)
if columns is None:
columns = data.columns
colSize = columns.shape[0]
if colSize > maxColSize:
numOfSplits = np.ceil(colSize / maxColSize).astype(int)
colsSplit = [
columns[i * maxColSize:(i + 1) * maxColSize]
for i in range(numOfSplits)
]
_colsTabNum = ChainMap(*[
dict(zip(columns, ['data{}'.format(num)] * colSize))
for num, columns in enumerate(colsSplit)
])
colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
for num, cols in enumerate(colsSplit):
store.put('data{}'.format(num), data[cols], format='table')
store.put('colsTabNum', colsTabNum, format='fixed')
else:
store.put('data', data[columns], format='table')
store.close()
具有上述功能的存储在HDFStore中的数据帧可以通过以下功能读取。
def read_hdf_wideDf(filename, columns=None, **kwargs):
"""Read a `pandas.DataFrame` from a HDFStore.
Parameter
---------
filename : str
name of the HDFStore
columns : list
the columns in this list are loaded. Load all columns,
if set to `None`.
Returns
-------
data : pandas.DataFrame
loaded data.
"""
store = pd.HDFStore(filename)
data = []
colsTabNum = store.select('colsTabNum')
if colsTabNum is not None:
if columns is not None:
tabNums = pd.Series(
index=colsTabNum[columns].values,
data=colsTabNum[columns].data).sort_index()
for table in tabNums.unique():
data.append(
store.select(table, columns=tabsNum[table], **kwargs))
else:
for table in colsTabNum.unique():
data.append(store.select(table, **kwargs))
data = pd.concat(data, axis=1).sort_index(axis=1)
else:
data = store.select('data', columns=columns)
store.close()
return data
答案 2 :(得分:4)
截至2014年,hdf为updated
If you are using HDF5 1.8.0 or previous releases, there is a limit on the number of fields you can have in a compound datatype. This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail. One user was able to create up to 1260 fields in a compound datatype before it failed.)
至于pandas
,它可以使用format='fixed'
选项,格式&#39;表&#39;来保存带有数量的列的数据帧。仍然会出现与主题相同的错误。
我还试过了h5py
,并得到了错误的标题&#39;同样(虽然我有版本&gt; 1.8.0)。
答案 3 :(得分:1)
###USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
##############################################################################
#Assuming that this is your model architecture. However, you may use
#whatever architecture, you want to (big or small; any).
def mymodel():
inputShape= (28, 28, 3);
model= Sequential()
model.add(Conv2D(20, 5, padding="same", input_shape=inputShape))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dense(2, activation= "softmax"))
return model
model.fit(....) #paramaters to start training your model
################################################################################
################################################################################
#once your model has been trained, you want to save your model in your PC
#use get_weights() command to get your model weights
weigh= model.get_weights()
#now, use pickle to save your model weights, instead of .h5
#for heavy model architectures, .h5 file is unsupported.
pklfile= "D:/modelweights.pkl"
try:
fpkl= open(pklfile, 'wb') #Python 3
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
except:
fpkl= open(pklfile, 'w') #Python 2
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
################################################################################
################################################################################
#in future, you may want to load your model back
#use pickle to load model weights
pklfile= "D:/modelweights.pkl"
try:
f= open(pklfile) #Python 2
weigh= pickle.load(f);
f.close();
except:
f= open(pklfile, 'rb') #Python 3
weigh= pickle.load(f);
f.close();
restoredmodel= mymodel()
#use set_weights to load the modelweights into the model architecture
restoredmodel.set_weights(weigh)
################################################################################
################################################################################
#now, you can do your testing and evaluation- predictions
y_pred= restoredmodel.predict(X)