Question

所以，我对Pandas包装有了新的认识。我正在对ETF策略进行一些回溯测试，我需要对Pandas Dataframe进行大量查询。

所以，让我们说这两个DataFrames，df和df1，唯一的区别是： df具有日期时间索引，而df1具有作为列的时间戳和整数索引

In[104]: df.head()
Out[104]: 

                       high     low    open   close   volume  openInterest
2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

In[105]: df1.head()
Out[105]: 

                dates    high     low    open   close   volume  openInterest
0 2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
1 2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2 2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
3 2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
4 2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

所以我稍微测试一下查询速度：

In[100]: %timeit df1[(df1['dates'] >= '2015-11-17') & (df1['dates'] < '2015-11-18')]
%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
100 loops, best of 3: 4.67 ms per loop
100 loops, best of 3: 3.14 ms per loop
1 loop, best of 3: 259 ms per loop

令我惊讶的是，使用Pandas内置的逻辑实际上是最慢的：

df.loc['2015-11-17']

有谁知道为什么会这样？是否有关于查询Pandas DataFrame的最有效方法的文档或博客？

Answer 1

如果我是你，我会使用更简单的方法：

df['2015-11-17']

在我看来，这比使用.loc[]一个日期更像是“熊猫逻辑”。我猜它也更快。

测试一分钟OHLC数据帧：

%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
%timeit df['2015-11-17']

100 loops, best of 3: 13.8 ms per loop
1 loop, best of 3: 1.39 s per loop
1000 loops, best of 3: 486 us per loop

使用Datetime索引或列查询Python Pandas DataFrame

1 个答案: