我有一个带有TimeIndexed值的宽Pandas数据帧,我想用我做的Interval对象选择:
inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
pd.Timestamp('2017-12-05 22:00:00'), closed='left')
我尝试了loc和iloc方法,但他们不接受Interval实例作为参数。
我可以测试时间戳是否在那个Interval中:
pd.Timestamp('2017-12-05 22:00:00') in inter
但是我无法写一行来选择数据帧的行。
答案 0 :(得分:2)
设置
s = pd.Series(
pd.date_range('2017-12-05 16:00:00', '2017-12-05 23:00:00', freq='H')
)
s
0 2017-12-05 16:00:00
1 2017-12-05 17:00:00
2 2017-12-05 18:00:00
3 2017-12-05 19:00:00
4 2017-12-05 20:00:00
5 2017-12-05 21:00:00
6 2017-12-05 22:00:00
7 2017-12-05 23:00:00
dtype: datetime64[ns]
以下是如何解决此问题,所有4个的区间包含。
closed='left'
(inter.left <= s) & (s < inter.right)
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 False
dtype: bool
closed='right'
(inter.left < s) & (s <= inter.right)
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
dtype: bool
closed='neither'
(inter.left < s) & (s < inter.right)
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 False
dtype: bool
closed='both'
(使用pd.Series.between
,应该稍微提高效率。)
s.between(inter.left, inter.right, inclusive=True)
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
dtype: bool
使用这些方法中的任何一种计算这些数量之后,选择行就很容易s[mask]
,其中mask
是我们刚才计算的布尔掩码。
答案 1 :(得分:1)
这是我的一个例子。我们可以使用loc
,我将逐步引导您完成它:
import pandas as pd
inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
pd.Timestamp('2017-12-05 22:00:00'), closed='left')
# creating a dataframe of different dates ranging from 12/03 to 12/07
df3 = pd.DataFrame({'Dates':pd.date_range(pd.Timestamp('2017-12-03 16:36:17'),
pd.Timestamp('2017-12-07 22:00:00'), freq='H')})
# creating a column to see if the data is in between the interval you created.
df3['In?'] = df3['Dates'].apply(lambda x: x in inter)
#filtering that dataframe
df3.loc[df3['In?'] ==True]
现在您可以跳过创建In?
列并立即进行过滤,但我希望您看到步骤
df3.loc[df3['Dates'].apply(lambda x: x in inter) == True]
是如何在不使用in
方法创建apply()
的情况下执行此操作
答案 2 :(得分:1)
借用@MattR的答案借用样本数据集
In [114]: df3.query("@inter.left <= Dates < @inter.right")
Out[114]:
Dates
48 2017-12-05 16:36:17
49 2017-12-05 17:36:17
50 2017-12-05 18:36:17
51 2017-12-05 19:36:17
52 2017-12-05 20:36:17
53 2017-12-05 21:36:17
约100K行DF的时间:
In [109]: df = pd.concat([df3]*1000, ignore_index=True)
In [110]: df.shape
Out[110]: (102000, 1)
In [111]: %timeit df.query("@inter.left <= Dates < @inter.right")
9.1 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [112]: %timeit df.loc[df['Dates'].apply(lambda x: x in inter) == True]
1.54 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [113]: %timeit df[df['Dates'].between(inter.left, inter.right, inclusive=True)]
3.96 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)