我有像这样的熊猫数据框
2011-5-5 12:43 noEvent CarA otherColumns...
2011-5-5 12:45 noEvent CarA ...
2011-5-5 12:49 EVENT CarA ...
2011-5-5 12:51 noEvent CarA ...
(no data - jumps in time)
2011-5-6 12:52 EVENT CarA ...
2011-5-6 12:59 noEvent CarA ...
2011-5-6 13:00 noEvent CarA ...
2011-5-5 12:43 noEvent CarB ...
2011-5-5 12:45 noEvent CarB ...
2011-5-5 12:49 noEvent CarB ...
2011-5-5 12:51 noEvent CarB ...
(no data - jumps in time)
2011-5-6 12:52 noEvent CarB ...
2011-5-6 12:52 EVENT CarB ...
2011-5-6 13:00 noEvent CarB ...
说明:
我需要执行一些计算 + - 发生事件前后2分钟以及每辆车。
要做到这一点,我很困惑......如何过滤这个数据框?
期望的结果将如下所示
-2min
2011-5-5 12:49 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarB ...
+2min
一些信息:
我不知道从哪里开始......
答案 0 :(得分:1)
首先按Car列分组并按如下方式处理每个组:
首先创建测试数据:
import pandas as pd
import numpy as np
np.random.seed(1)
idx = pd.date_range("2016-03-01 10:00:00", "2016-03-01 20:00:00", freq="S")
idx = idx[np.random.randint(0, len(idx), 10000)].sort_values()
evt = np.array(["no event", "event"])[(np.random.rand(len(idx)) < 0.0005).astype(int)]
df = pd.DataFrame({"event":evt, "value":np.random.randint(0, 10, len(evt))}, index=idx)
找到事件行和行索引+/- 10秒:
event_time = df.index[df.event == "event"]
delta = pd.Timedelta(10, unit="s")
start_idx = df.index.searchsorted(event_time - delta).tolist()
end_idx = df.index.searchsorted(event_time + delta).tolist()
创建掩码数组:
mask = np.zeros(df.shape[0], dtype=bool)
evt_id = np.zeros(df.shape[0], dtype=int)
for i, (s, e) in enumerate(zip(start_idx, end_idx)):
mask[s:e] = True
evt_id[s:e] = i
使用mask数组来过滤行,这里我创建了一个event_id列来分组事件:
df_event = df[mask]
df_event["event_id"] = evt_id[mask]
输出:
event value event_id
2016-03-01 13:51:48 no event 0 0
2016-03-01 13:51:51 event 8 0
2016-03-01 13:51:53 no event 3 0
2016-03-01 13:52:00 no event 1 0
2016-03-01 14:21:00 no event 2 1
2016-03-01 14:21:00 no event 5 1
2016-03-01 14:21:00 no event 0 1
2016-03-01 14:21:02 no event 1 1
2016-03-01 14:21:04 no event 2 1
2016-03-01 14:21:06 no event 0 1
2016-03-01 14:21:07 event 1 1
2016-03-01 14:21:16 no event 1 1
2016-03-01 14:21:16 no event 9 1
2016-03-01 15:09:42 no event 1 2
2016-03-01 15:09:49 event 7 2
2016-03-01 15:09:54 no event 3 2
2016-03-01 15:09:55 no event 3 2
2016-03-01 15:09:58 no event 5 2
2016-03-01 15:09:58 no event 9 2
2016-03-01 17:36:44 no event 8 3
2016-03-01 17:36:44 no event 2 3
2016-03-01 17:36:44 no event 9 3
2016-03-01 17:36:45 no event 2 3
2016-03-01 17:36:49 event 9 3
2016-03-01 17:36:50 no event 6 3
2016-03-01 17:36:54 no event 1 3
2016-03-01 17:36:56 no event 1 3
2016-03-01 18:51:37 no event 5 4
2016-03-01 18:51:37 no event 3 4
2016-03-01 18:51:42 no event 0 4
2016-03-01 18:51:47 event 9 4
2016-03-01 18:51:55 no event 4 4
答案 1 :(得分:1)
考虑交叉连接合并,比较所有事件过滤的数据帧和完整的数据帧。然后由同一辆汽车落在+/- 2分钟内的子集记录:
数据框设置(示例发布数据)
import pandas as pd
import datetime
df = pd.DataFrame({'Date': ['5/5/2011 12:43', '5/5/2011 12:45', '5/5/2011 12:49',
'5/5/2011 12:51', '5/6/2011 12:52', '5/6/2011 12:59',
'5/6/2011 13:00', '5/5/2011 12:43', '5/5/2011 12:45',
'5/5/2011 12:49', '5/5/2011 12:51', '5/6/2011 12:52',
'5/6/2011 12:52', '5/6/2011 13:00'],
'Event': ['noEvent', 'noEvent', 'EVENT', 'noEvent','EVENT',
'noEvent', 'noEvent', 'noEvent', 'noEvent', 'noEvent',
'noEvent', 'noEvent', 'EVENT', 'noEvent'],
'Car': ['CarA', 'CarA', 'CarA', 'CarA', 'CarA',
'CarA', 'CarA', 'CarB', 'CarB','CarB',
'CarB', 'CarB', 'CarB', 'CarB']})
df['Date'] = pd.to_datetime(df['Date'])
# Car Date Event
# 0 CarA 2011-05-05 12:43:00 noEvent
# 1 CarA 2011-05-05 12:45:00 noEvent
# 2 CarA 2011-05-05 12:49:00 EVENT
# 3 CarA 2011-05-05 12:51:00 noEvent
# 4 CarA 2011-05-06 12:52:00 EVENT
# 5 CarA 2011-05-06 12:59:00 noEvent
# 6 CarA 2011-05-06 13:00:00 noEvent
# 7 CarB 2011-05-05 12:43:00 noEvent
# 8 CarB 2011-05-05 12:45:00 noEvent
# 9 CarB 2011-05-05 12:49:00 noEvent
# 10 CarB 2011-05-05 12:51:00 noEvent
# 11 CarB 2011-05-06 12:52:00 noEvent
# 12 CarB 2011-05-06 12:52:00 EVENT
# 13 CarB 2011-05-06 13:00:00 noEvent
交叉连接(返回两个配对M X N之间的完整组合集)
df['key'] = 1
# EVENTS DF
eventsdf = df[df['Event']=='EVENT']
# CROSS JOIN DF
crossdf = pd.merge(df, eventsdf, on='key')
crossdf = crossdf[((crossdf['Date_x'] <= crossdf['Date_y']
+ datetime.timedelta(minutes=2)) &
(crossdf['Date_x'] >= crossdf['Date_y']
- datetime.timedelta(minutes=2))) &
(crossdf['Car_x'] == crossdf['Car_y'])].sort_values('Date_x')
finaldf = crossdf[['Car_x', 'Date_x', 'Event_x']].drop_duplicates().sort_values('Car_x')
finaldf.columns = ['Car', 'Date', 'Event']
# Car Date Event
# 6 CarA 2011-05-05 12:49:00 EVENT
# 9 CarA 2011-05-05 12:51:00 noEvent
# 13 CarA 2011-05-06 12:52:00 EVENT
# 35 CarB 2011-05-06 12:52:00 noEvent
# 38 CarB 2011-05-06 12:52:00 EVENT