返回内核查询的一部分

时间:2014-08-06 17:20:02

标签: python pytables

我试图返回一个in-kernal pytables查询的片段而不先返回整个范围,然后取[-1],因为查询的大小非常大。作为一个例子,我有表格

的数据
import tables

class Tick(tables.IsDescription):
    timestamp = tables.Int64Col()
    bid = tables.Float64Col()
    ask = tables.Float64Col()

h5file = tables.openFile('test.h5','w')
tbl = h5file.createTable('/', 'ticks', Tick)

rows = [(123, 1.34, 1.35),(127, 1.345, 1.355),(128, 1.35, 1.36)]
tick = tbl.row
for row in rows:
    tick['bid'] = row[1]
    tick['ask'] = row[2]
    tick['timestamp'] = row[0]
    tick.append()
tbl.flush()
h5file.close()

我想做一些表格

tbl.readWhere('tail 1 (timestamp <= 127)')

具有相同的效果
tbl.readWhere('(timestamp <= 127)')[-1]

但更有效。我已经看过使用start / stop参数,但是这确实是切片预条件语句,而我需要它发布条件。

确切数据的格式为

09/05/14 20:59:41,1.37580,1.37620
09/05/14 20:59:43,437584,1.37624
09/05/14 20:59:45,1.37580,1.37620
09/05/14 20:59:45,1.37578,1.37622
09/05/14 20:59:45,1.37574,1.37624
09/05/14 20:59:58,1.37574,1.37624

1 个答案:

答案 0 :(得分:1)

使用Pytables 3.0.0(和pandas 0.14.1,它以更高级别的方式与PyTables连接;生成的表可以从任何一个访问)。

In [1]: pd.set_option('max_rows',10)

In [2]: N = 100000000

In [3]: df = DataFrame(dict(A = np.random.randn(N), B = np.random.randn(N)), index=date_range('20130101',freq='ms',periods=N))
df

In [4]: df
Out[4]: 
                                   A         B
2013-01-01 00:00:00        -1.184339 -0.362050
2013-01-01 00:00:00.001000 -0.431403 -0.602782
2013-01-01 00:00:00.002000  0.582003  1.207553
2013-01-01 00:00:00.003000  0.208940 -0.507944
2013-01-01 00:00:00.004000 -1.402088 -0.502517
...                              ...       ...
2013-01-02 03:46:39.995000  1.815447 -0.050623
2013-01-02 03:46:39.996000  0.071673  1.138665
2013-01-02 03:46:39.997000 -0.778820 -0.280813
2013-01-02 03:46:39.998000  0.920727  0.570497
2013-01-02 03:46:39.999000 -1.205459  0.437231

[100000000 rows x 2 columns]

In [5]: df.to_hdf('test.hdf','df', mode='w',format='table',compress='blosc')

In [6]: pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")')
Out[6]: 
                                   A         B
2013-01-02 01:00:00.001000 -0.210051 -0.866118
2013-01-02 01:00:00.002000 -1.164465  0.388854
2013-01-02 01:00:00.003000  1.110326  0.925144
2013-01-02 01:00:00.004000  0.565132 -0.291035
2013-01-02 01:00:00.005000 -1.026886  0.047159
...                              ...       ...
2013-01-02 01:00:00.995000  0.280094 -1.080868
2013-01-02 01:00:00.996000 -1.394722 -0.523851
2013-01-02 01:00:00.997000  0.072997 -0.643343
2013-01-02 01:00:00.998000  0.721472  0.447951
2013-01-02 01:00:00.999000 -0.838169 -0.794621

[999 rows x 2 columns]

In [8]: %timeit pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")').iloc[-1]
10 loops, best of 3: 31.6 ms per loop

我的建议是选择一个比你需要的更大的典型范围。如果它没有任何值,那么选择一个更大的值(如你所说,你的时间序列可能略微不规则)。无论如何,索引是关键(大熊猫会自动在&#39;索引上创建所需的索引)。选择变得非常有效。可以以相对较高的方式指定查询,请参阅here

这里参考的是生成的PyTables元数据

In [5]: !ptdump -av test.hdf
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {'freq': <Milli>}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0']]
/df/table (Table(100000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (21845,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 100000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'datetime64',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]