我想做什么?
pd.read_csv(... nrows=###)
可以读取文件的右上角。我想在使用pd.read_hdf(...)
时也这样做。
有什么问题?
我对documentation感到困惑。 start
和stop
看起来像我需要的但是当我尝试时,会返回ValueError
。我尝试的第二件事是使用nrows=10
认为它可能是允许的**kwargs
。当我这样做时,不会抛出任何错误,而是返回完整的数据集,而不仅仅是10行。
问题:如何从HDF文件中正确读取较小的行子集? (编辑:不必先将整个内容读入内存!)
以下是我的互动会话:
>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
df = pd.read_hdf('storage.h5')
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
return it.get_result()
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
columns=columns, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
return self.obj_type(BlockManager(blocks, axes))
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
self._verify_integrity()
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)
修改:
我刚尝试使用where
:
mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')
收到TypeError,表明已修复格式存储(默认format
值df.to_hdf(...)
):
TypeError: cannot pass a where specification when reading from a
Fixed format store. this store must be selected in its entirety
这是否意味着如果格式为固定格式,我无法选择行的子集?
答案 0 :(得分:0)
我遇到了同样的问题。我现在可以肯定的是https://github.com/pandas-dev/pandas/issues/11188跟踪了这个问题。这是一张2015年的票,里面有一张repro。 Jeff Reback暗示这实际上是一个错误,甚至他还向我们指出了2015年的解决方案。这仅仅是没有人构建该解决方案。我可以尝试一下。
答案 1 :(得分:0)
现在看来,至少在熊猫1.0.1中可以使用。只需提供start
和stop
参数:
df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)