Question

我有一个2Gb的数据帧，这是一次写入，读取很多df。我想在pandas中使用df，因此我以固定格式使用df.read_hdf和df.to_hdf，这在阅读和写作方面非常好。

然而，随着添加更多列，df正在增长，所以我想使用表格格式，因此我可以在读取数据时选择所需的列。我认为这会给我一个速度优势，但从测试来看似乎并非如此。

这个例子：

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
%time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w")
%time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")

显示固定格式略快（我的机器上6.8秒对5.9秒）。

然后读取数据（经过一小段时间后确保文件已完全保存）：

%time x = pd.read_hdf("temp.h5", "temp")
%time y = pd.read_hdf("temp2.h5", "temp2")
%time z = pd.read_hdf("temp2.h5", "temp2", columns=list("ABC"))

收益率：

Wall time: 420 ms (fixed)   
Wall time: 557 ms (format)   
Wall time: 671 ms (format, specified columns)

我知道固定格式读取数据的速度更快，但为什么会这样 df指定列比读取完整数据帧慢？使用表格式（带或不带指定列）比固定格式有什么好处？

当df增长得更大时，是否有记忆优势？

Answer 1

IMO将format='table'与data_columns=[list_of_indexed_columns]结合使用的主要优点是能够有条件地（参见where="where clause"参数）读取巨大的HDF5文件。这样您就可以在阅读时过滤数据并以块的形式处理数据以避免MemoryError。

您可以尝试在不同的HDF文件中保存单个列或列组（大多数情况下将一起读取的组）或使用不同键的同一文件。

我还考虑使用“尖端”技术 - Feather-Format

测试和时间安排

import feather

以三种格式写入磁盘:( HDF5固定，HDF％表，羽毛）

df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI')) df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w') df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w') feather.write_dataframe(df, 'c:/temp/df.feather')

从磁盘读取：

In [122]: %timeit pd.read_hdf(r'C:\Temp\fixed.h5', "temp") 1 loop, best of 3: 409 ms per loop In [123]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp") 1 loop, best of 3: 558 ms per loop In [124]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp", columns=list('BDF')) The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 689 ms per loop In [125]: %timeit feather.read_dataframe('c:/temp/df.feather') The slowest run took 6.92 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 644 ms per loop In [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('BDF')) 1 loop, best of 3: 218 ms per loop # WINNER !!!

PS如果在使用feather.write_dataframe(...)时遇到以下错误：

FeatherError: Invalid: no support for strided data yet

这是一个解决方法：

df = df.copy()

之后feather.write_dataframe(df, path)应该正常工作......

使用pandas.to_hdf

1 个答案: