以下数据集具有到达特定公交车站的公交的gps时间戳。当公共汽车在停车处闲置时,gps发射器继续以半规则增量发送数据。
我正在尝试从该巴士站编制每辆巴士的出发时间。然而,复杂的因素是,相同的公交车可能会以大约2小时的间隔重复路线。
在下面的数据框中,如果总线NYCT_1202
在第10:01:19
行的0
处停止,并一直停留在停止,直到10:11:48
行的1
,则I想以某种方式为10:11:48
选择。
类似地,两个小时后,当同一辆公交车在2
的{{1}}行中再次循环到达停靠站时,它“空转”(也许是停止运行),直到{ {1}}。我想最后一次选择12:51:31
。
13:51:02
如果公共汽车没有回绕,我可以使用13:51:02
进行最后一次选择。
我还尝试使用df = pd.DataFrame({'RecordedAtTime': {0: Timestamp('2017-08-23 10:01:19'),
1: Timestamp('2017-08-23 10:11:48'),
2: Timestamp('2017-08-23 12:51:31'),
3: Timestamp('2017-08-23 13:02:02'),
4: Timestamp('2017-08-23 13:11:27'),
5: Timestamp('2017-08-23 13:51:35'),
6: Timestamp('2017-08-23 16:12:27'),
7: Timestamp('2017-08-23 16:52:25'),
8: Timestamp('2017-08-07 09:33:42'),
9: Timestamp('2017-08-07 10:13:36')},
'VehicleRef': {0: 'NYCT_1202',
1: 'NYCT_1202',
2: 'NYCT_1202',
3: 'NYCT_1202',
4: 'NYCT_1202',
5: 'NYCT_1202',
6: 'NYCT_1202',
7: 'NYCT_1202',
8: 'NYCT_1206',
9: 'NYCT_1206'}})
RecordedAtTime VehicleRef
0 2017-08-23 10:01:19 NYCT_1202
1 2017-08-23 10:11:48 NYCT_1202 <-This Row
2 2017-08-23 12:51:31 NYCT_1202
3 2017-08-23 13:02:02 NYCT_1202
4 2017-08-23 13:11:27 NYCT_1202
5 2017-08-23 13:51:35 NYCT_1202 <-This Row
6 2017-08-23 16:12:27 NYCT_1202
7 2017-08-23 16:52:25 NYCT_1202 <-This Row
8 2017-08-07 09:33:42 NYCT_1206
9 2017-08-07 10:13:36 NYCT_1206 <-This Row
创建一个df.groupby(by=['VehicleRef','RecordedAtTime']).last
列,以便可以应用TimeDelta
。但是,df['TimeDelta']=df['RecordedAtTime'].diff
不会在第0行的第0次和第1次之间产生差,这意味着我无法通过其时间增量选择行。
df.loc[lambda x: x['TimeDelta']>2]
那么我可以使用哪个熊猫库来解决这个问题?是否有使用diff
的更好方法,还是应该以完全不同的方式解决此问题?
答案 0 :(得分:1)
import pandas as pd
from pandas import Timestamp
import datetime as datetime
# Approximate trip duration
trip_minutes = datetime.timedelta(minutes = 90)
# Ensure ordering by time grouped by vehicle
df = df.sort_values('RecordedAtTime')
dfg = df.groupby('VehicleRef')
# Elapsed time interval is the difference, within vehicle group
df['Elapsed'] = dfg['RecordedAtTime'].diff()
# Elapsed time close to the trip time indicates a trip ending
df['isEnd'] = df['Elapsed'] > trip_minutes
# The start is the row just before the last end - use shift within group
df['isStart'] = dfg['isEnd'].shift(-1)
# select the rows ensuring that a NaN start event is included
df[df['isStart'] != False]
结果:
RecordedAtTime VehicleRef Elapsed isEnd isStart
9 2017-08-07 10:13:36 NYCT_1206 00:39:54 False NaN
1 2017-08-23 10:11:48 NYCT_1202 00:10:29 False True
5 2017-08-23 13:51:35 NYCT_1202 00:40:08 False True
7 2017-08-23 16:52:25 NYCT_1202 00:39:58 False NaN