Question

h5和hdf文件有什么区别？我应该使用一个吗？我尝试使用以下两个代码执行timeit，每个循环花费大约3分29秒，并且使用240mb文件。我最终在第二个代码上出现了错误，但磁盘上的文件大小超过了300mb。

hdf = pd.HDFStore('combined.h5')
hdf.put('table', df, format='table', complib='blosc', complevel=5, data_columns=True)
hdf.close()


df.to_hdf('combined.hdf', 'table', format='table', mode='w', complib='blosc', complevel=5)

另外，我收到一条错误消息：

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types

这是由于字符串列因空白值而成为对象。如果我做.astype（str），所有的空格都被替换为nan（一个甚至出现在输出文件中的字符串）。我是否担心错误消息并填写空白并稍后再次使用np.nan替换它们，或者只是忽略它？

这是df.info（），表明有些列有空值。我无法删除这些行，但如果需要，我可以暂时填写它们。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1387276 entries, 0 to 657406
Data columns (total 12 columns):
date                          1387276 non-null datetime64[ns]
start_time                    1387276 non-null datetime64[ns]
end_time                      313190 non-null datetime64[ns]
cola                          1387276 non-null object
colb                          1387276 non-null object
colc                          1387276 non-null object
cold                          476816 non-null object
cole                          1228781 non-null object
colx                          1185679 non-null object
coly                          313190 non-null object
colz                          1387276 non-null int64
colzz                         1387276 non-null int64
dtypes: datetime64[ns](3), int64(2), object(7)
memory usage: 137.6+ MB

优化将数据帧发送到hdf？

0 个答案: