如何使用diff选择彼此靠近但处于未知范围内的时间?

时间:2018-12-07 01:22:13

标签: python pandas

以下数据集具有到达特定公交车站的公交的gps时间戳。当公共汽车在停车处闲置时,gps发射器继续以半规则增量发送数据。

我正在尝试从该巴士站编制每辆巴士的出发时间。然而,复杂的因素是,相同的公交车可能会以大约2小时的间隔重复路线。

在下面的数据框中,如果总线NYCT_1202在第10:01:19行的0处停止,并一直停留在停止,直到10:11:48行的1,则I想以某种方式为10:11:48选择。

类似地,两个小时后,当同一辆公交车在2的{​​{1}}行中再次循环到达停靠站时,它“空转”(也许是停止运行),直到{ {1}}。我想最后一次选择12:51:31

13:51:02

如果公共汽车没有回绕,我可以使用13:51:02进行最后一次选择。

我还尝试使用df = pd.DataFrame({'RecordedAtTime': {0: Timestamp('2017-08-23 10:01:19'), 1: Timestamp('2017-08-23 10:11:48'), 2: Timestamp('2017-08-23 12:51:31'), 3: Timestamp('2017-08-23 13:02:02'), 4: Timestamp('2017-08-23 13:11:27'), 5: Timestamp('2017-08-23 13:51:35'), 6: Timestamp('2017-08-23 16:12:27'), 7: Timestamp('2017-08-23 16:52:25'), 8: Timestamp('2017-08-07 09:33:42'), 9: Timestamp('2017-08-07 10:13:36')}, 'VehicleRef': {0: 'NYCT_1202', 1: 'NYCT_1202', 2: 'NYCT_1202', 3: 'NYCT_1202', 4: 'NYCT_1202', 5: 'NYCT_1202', 6: 'NYCT_1202', 7: 'NYCT_1202', 8: 'NYCT_1206', 9: 'NYCT_1206'}}) RecordedAtTime VehicleRef 0 2017-08-23 10:01:19 NYCT_1202 1 2017-08-23 10:11:48 NYCT_1202 <-This Row 2 2017-08-23 12:51:31 NYCT_1202 3 2017-08-23 13:02:02 NYCT_1202 4 2017-08-23 13:11:27 NYCT_1202 5 2017-08-23 13:51:35 NYCT_1202 <-This Row 6 2017-08-23 16:12:27 NYCT_1202 7 2017-08-23 16:52:25 NYCT_1202 <-This Row 8 2017-08-07 09:33:42 NYCT_1206 9 2017-08-07 10:13:36 NYCT_1206 <-This Row 创建一个df.groupby(by=['VehicleRef','RecordedAtTime']).last列,以便可以应用TimeDelta。但是,df['TimeDelta']=df['RecordedAtTime'].diff不会在第0行的第0次和第1次之间产生差,这意味着我无法通过其时间增量选择行。

df.loc[lambda x: x['TimeDelta']>2]

那么我可以使用哪个熊猫库来解决这个问题?是否有使用diff的更好方法,还是应该以完全不同的方式解决此问题?

1 个答案:

答案 0 :(得分:1)

import pandas as pd
from pandas import Timestamp
import datetime as datetime

# Approximate trip duration
trip_minutes = datetime.timedelta(minutes = 90)

# Ensure ordering by time grouped by vehicle
df  = df.sort_values('RecordedAtTime')
dfg = df.groupby('VehicleRef')

# Elapsed time interval is the difference, within vehicle group
df['Elapsed'] = dfg['RecordedAtTime'].diff()

# Elapsed time close to the trip time indicates a trip ending
df['isEnd'] = df['Elapsed'] > trip_minutes

# The start is the row just before the last end - use shift  within group
df['isStart'] = dfg['isEnd'].shift(-1)

# select the rows ensuring that a NaN start event is included
df[df['isStart'] != False]

结果:

       RecordedAtTime VehicleRef  Elapsed  isEnd isStart
9 2017-08-07 10:13:36  NYCT_1206 00:39:54  False     NaN
1 2017-08-23 10:11:48  NYCT_1202 00:10:29  False    True
5 2017-08-23 13:51:35  NYCT_1202 00:40:08  False    True
7 2017-08-23 16:52:25  NYCT_1202 00:39:58  False     NaN