选择具有间隔的行

时间:2018-02-07 20:18:15

标签: python pandas dataframe intervals

我有一个带有TimeIndexed值的宽Pandas数据帧,我想用我做的Interval对象选择:

inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')

我尝试了loc和iloc方法,但他们不接受Interval实例作为参数。

我可以测试时间戳是否在那个Interval中:

pd.Timestamp('2017-12-05 22:00:00') in inter

但是我无法写一行来选择数据帧的行。

3 个答案:

答案 0 :(得分:2)

设置

s = pd.Series(
      pd.date_range('2017-12-05 16:00:00', '2017-12-05 23:00:00', freq='H')
)
s

0   2017-12-05 16:00:00
1   2017-12-05 17:00:00
2   2017-12-05 18:00:00
3   2017-12-05 19:00:00
4   2017-12-05 20:00:00
5   2017-12-05 21:00:00
6   2017-12-05 22:00:00
7   2017-12-05 23:00:00
dtype: datetime64[ns]

以下是如何解决此问题,所有4个的区间包含。

  1. closed='left'

    (inter.left <= s) & (s < inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6    False
    7    False
    dtype: bool
    
  2. closed='right'

    (inter.left < s) & (s <= inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6     True
    7    False
    dtype: bool
    
  3. closed='neither'

    (inter.left < s) & (s < inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6    False
    7    False
    dtype: bool
    
  4. closed='both'(使用pd.Series.between,应该稍微提高效率。)

    s.between(inter.left, inter.right, inclusive=True) 
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6     True
    7    False
    dtype: bool
    
  5. 使用这些方法中的任何一种计算这些数量之后,选择行就很容易s[mask],其中mask是我们刚才计算的布尔掩码。

答案 1 :(得分:1)

这是我的一个例子。我们可以使用loc,我将逐步引导您完成它:

import pandas as pd
inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')
# creating a dataframe of different dates ranging from 12/03 to 12/07
df3 = pd.DataFrame({'Dates':pd.date_range(pd.Timestamp('2017-12-03 16:36:17'), 
      pd.Timestamp('2017-12-07 22:00:00'), freq='H')})

# creating a column to see if the data is in between the interval you created.
df3['In?'] = df3['Dates'].apply(lambda x: x in inter)

#filtering that dataframe 
df3.loc[df3['In?'] ==True]

现在您可以跳过创建In?列并立即进行过滤,但我希望您看到步骤

df3.loc[df3['Dates'].apply(lambda x: x in inter) == True]是如何在不使用in方法创建apply()的情况下执行此操作

答案 2 :(得分:1)

借用@MattR的答案借用样本数据集

In [114]: df3.query("@inter.left <= Dates < @inter.right")
Out[114]:
                 Dates
48 2017-12-05 16:36:17
49 2017-12-05 17:36:17
50 2017-12-05 18:36:17
51 2017-12-05 19:36:17
52 2017-12-05 20:36:17
53 2017-12-05 21:36:17

约100K行DF的时间:

In [109]: df = pd.concat([df3]*1000, ignore_index=True)

In [110]: df.shape
Out[110]: (102000, 1)

In [111]: %timeit df.query("@inter.left <= Dates < @inter.right")
9.1 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [112]: %timeit df.loc[df['Dates'].apply(lambda x: x in inter) == True]
1.54 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [113]: %timeit df[df['Dates'].between(inter.left, inter.right, inclusive=True)]
3.96 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)