我试图返回一个in-kernal pytables查询的片段而不先返回整个范围,然后取[-1],因为查询的大小非常大。作为一个例子,我有表格
的数据import tables
class Tick(tables.IsDescription):
timestamp = tables.Int64Col()
bid = tables.Float64Col()
ask = tables.Float64Col()
h5file = tables.openFile('test.h5','w')
tbl = h5file.createTable('/', 'ticks', Tick)
rows = [(123, 1.34, 1.35),(127, 1.345, 1.355),(128, 1.35, 1.36)]
tick = tbl.row
for row in rows:
tick['bid'] = row[1]
tick['ask'] = row[2]
tick['timestamp'] = row[0]
tick.append()
tbl.flush()
h5file.close()
我想做一些表格
tbl.readWhere('tail 1 (timestamp <= 127)')
与
具有相同的效果tbl.readWhere('(timestamp <= 127)')[-1]
但更有效。我已经看过使用start / stop参数,但是这确实是切片预条件语句,而我需要它发布条件。
确切数据的格式为
09/05/14 20:59:41,1.37580,1.37620
09/05/14 20:59:43,437584,1.37624
09/05/14 20:59:45,1.37580,1.37620
09/05/14 20:59:45,1.37578,1.37622
09/05/14 20:59:45,1.37574,1.37624
09/05/14 20:59:58,1.37574,1.37624
答案 0 :(得分:1)
使用Pytables 3.0.0(和pandas 0.14.1,它以更高级别的方式与PyTables连接;生成的表可以从任何一个访问)。
In [1]: pd.set_option('max_rows',10)
In [2]: N = 100000000
In [3]: df = DataFrame(dict(A = np.random.randn(N), B = np.random.randn(N)), index=date_range('20130101',freq='ms',periods=N))
df
In [4]: df
Out[4]:
A B
2013-01-01 00:00:00 -1.184339 -0.362050
2013-01-01 00:00:00.001000 -0.431403 -0.602782
2013-01-01 00:00:00.002000 0.582003 1.207553
2013-01-01 00:00:00.003000 0.208940 -0.507944
2013-01-01 00:00:00.004000 -1.402088 -0.502517
... ... ...
2013-01-02 03:46:39.995000 1.815447 -0.050623
2013-01-02 03:46:39.996000 0.071673 1.138665
2013-01-02 03:46:39.997000 -0.778820 -0.280813
2013-01-02 03:46:39.998000 0.920727 0.570497
2013-01-02 03:46:39.999000 -1.205459 0.437231
[100000000 rows x 2 columns]
In [5]: df.to_hdf('test.hdf','df', mode='w',format='table',compress='blosc')
In [6]: pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")')
Out[6]:
A B
2013-01-02 01:00:00.001000 -0.210051 -0.866118
2013-01-02 01:00:00.002000 -1.164465 0.388854
2013-01-02 01:00:00.003000 1.110326 0.925144
2013-01-02 01:00:00.004000 0.565132 -0.291035
2013-01-02 01:00:00.005000 -1.026886 0.047159
... ... ...
2013-01-02 01:00:00.995000 0.280094 -1.080868
2013-01-02 01:00:00.996000 -1.394722 -0.523851
2013-01-02 01:00:00.997000 0.072997 -0.643343
2013-01-02 01:00:00.998000 0.721472 0.447951
2013-01-02 01:00:00.999000 -0.838169 -0.794621
[999 rows x 2 columns]
In [8]: %timeit pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")').iloc[-1]
10 loops, best of 3: 31.6 ms per loop
我的建议是选择一个比你需要的更大的典型范围。如果它没有任何值,那么选择一个更大的值(如你所说,你的时间序列可能略微不规则)。无论如何,索引是关键(大熊猫会自动在&#39;索引上创建所需的索引)。选择变得非常有效。可以以相对较高的方式指定查询,请参阅here
这里参考的是生成的PyTables元数据
In [5]: !ptdump -av test.hdf
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {'freq': <Milli>}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0']]
/df/table (Table(100000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (21845,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 100000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'datetime64',
values_block_0_dtype := 'float64',
values_block_0_kind := ['A', 'B']]