如何快速选择日期之间的行pandas dataframe

时间:2018-01-26 13:17:28

标签: python-2.7 pandas

我想知道在索引中两个日期之间选择行的速度方面最有效的方法是什么。例如

>>> import pandas as pd
>>> index = pd.date_range('2018-01-01', '2030-01-02', freq='BM')
>>> df = pd.DataFrame(np.zeros((len(index), 1)), index=index)
>>> df.head()
              0
2018-01-31  0.0
2018-02-28  0.0
2018-03-30  0.0
2018-04-30  0.0
2018-05-31  0.0

然后选择所有行之间的一种方法,例如2018-05-30 2027-07-03

>>> df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')]

在我的应用中,我不知道价值2018-05-30 2027-07-03。什么是实现所需选择的最快方法?

2 个答案:

答案 0 :(得分:1)

您可以使用truncate

print (df.truncate(before='2018-05-30', after='2027-07-03'))

print (df.loc['2018-05-30':'2027-07-03'])

print (df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')])

<强>计时

In [366]: %timeit (df.loc['2018-05-30':'2027-07-03'])
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.43 ms per loop

In [367]: %timeit (df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')])
The slowest run took 4.97 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 502 µs per loop

In [368]: %timeit (df.truncate(before='2018-05-30', after='2027-07-03'))
The slowest run took 4.98 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 450 µs per loop

如果您改变条件,如果存在,则不包括最后一个值 - <=更改为<

In [372]: %timeit (df.loc[(df.index >= '2018-05-31') & (df.index < '2027-05-31')])
The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [373]: %timeit (df.iloc[df.index.searchsorted('2018-05-31'): df.index.searchsorted('2027-05-31')])
10000 loops, best of 3: 136 µs per loop

答案 1 :(得分:0)

您的原始方法看起来是两个选项中较快的一个:

使用“&amp;”查找

In[]: %timeit -r 5 -n 10 df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')]
Out[]: 10 loops, best of 5: 501 µs per loop

使用“:”切片表示法查找:

In[]: %timeit -r 5 -n 10 df.loc['2018-05-30':'2027-07-03']
Out[]: 10 loops, best of 5: 724 µs per loop

所以你已经在使用优化的操作了。

编辑:添加了另一个较慢的操作,以证明这已经很快:

In[]: %timeit -r 5 -n 10 df[df.index.isin(pd.date_range("2018-05-30", "2027-07-03").values)]
Out[]: 10 loops, best of 5: 1.13 ms per loop