我在pandas数据帧中有1分钟时间步数据。此数据不会连续记录,现在我想根据以下条件将所有数据拆分为单独的事件: 如果连续数据记录5分钟或更长时间,则仅将其视为事件,并且对于此类事件数据需要单独提取。有没有办法在pandas数据帧中实现它。
我的数据看起来像这样(结果是事件列):
Date X Event
2017-06-06 01:08:00 0.019 1
2017-06-06 01:09:00 0.005 1
2017-06-06 01:10:00 0.03 1
2017-06-06 01:11:00 0.005 1
2017-06-06 01:12:00 0.003 1
2017-06-06 01:13:00 0.001 1
2017-06-06 01:14:00 0.039 1
2017-06-06 01:15:00 0.003 1
2017-06-06 01:17:00 0.001 nan
2017-06-06 01:25:00 0.006 nan
2017-06-06 01:26:00 0.006 nan
2017-06-06 01:27:00 0.032 nan
2017-06-06 01:29:00 0.013 2
2017-06-06 01:30:00 0.065 2
2017-06-06 01:31:00 0.013 2
2017-06-06 01:32:00 0.001 2
2017-06-06 01:33:00 0.02 2
2017-06-06 01:38:00 0.05 nan
2017-06-06 01:40:00 0.025 3
2017-06-06 01:41:00 0.01 3
2017-06-06 01:42:00 0.008 3
2017-06-06 01:43:00 0.009 3
2017-06-06 01:44:00 0.038 3
2017-06-06 01:45:00 0.038 3
非常感谢您的建议。
使用nnnmmm提供的解决方案,结果如下所示
2015-01-01 03:24:00 NaN
2015-01-01 04:59:00 NaN
2015-01-01 05:01:00 NaN
2015-01-01 05:02:00 NaN
2015-01-01 05:03:00 NaN
2015-01-13 01:12:00 1.0
2015-01-13 01:13:00 1.0
2015-01-13 01:14:00 1.0
2015-01-13 01:15:00 1.0
2015-01-13 01:16:00 1.0
2015-01-13 01:49:00 1.0
2015-01-13 01:50:00 1.0
2015-01-13 01:51:00 1.0
2015-01-13 01:52:00 1.0
2015-01-13 01:53:00 1.0
2015-01-13 01:54:00 1.0
2015-01-13 01:55:00 1.0
在这种情况下,01:16:00到01:49:00之间有时间变化,不应该将它视为同一事件,而01:49:00应该是第二个事件。
答案 0 :(得分:5)
这有点粗糙(并不是非常简洁),但你可以做类似下面的事情(你当然可以使它更简洁,省略中间变量,但我把它们留在这里使它更容易了解正在发生的事情。)
df['new'] = (df.reset_index().Date.diff() == pd.Timedelta('1min')).astype(int).values
df['grp'] = (df.new != 1).cumsum()
df['cnt'] = df.groupby('grp')['new'].transform(size)
df['event'] = df['cnt'] > 4
df['Event'] = ((df.event) & (df.new != 1)).cumsum()
df['Event'] = np.where( df.event, df.Event, np.nan )
X new grp cnt event Event
Date
2017-06-06 01:08:00 0.019 0 1 8 True 1.0
2017-06-06 01:09:00 0.005 1 1 8 True 1.0
2017-06-06 01:10:00 0.030 1 1 8 True 1.0
2017-06-06 01:11:00 0.005 1 1 8 True 1.0
2017-06-06 01:12:00 0.003 1 1 8 True 1.0
2017-06-06 01:13:00 0.001 1 1 8 True 1.0
2017-06-06 01:14:00 0.039 1 1 8 True 1.0
2017-06-06 01:15:00 0.003 1 1 8 True 1.0
2017-06-06 01:17:00 0.001 0 2 1 False NaN
2017-06-06 01:25:00 0.006 0 3 3 False NaN
2017-06-06 01:26:00 0.006 1 3 3 False NaN
2017-06-06 01:27:00 0.032 1 3 3 False NaN
2017-06-06 01:29:00 0.013 0 4 5 True 2.0
2017-06-06 01:30:00 0.065 1 4 5 True 2.0
2017-06-06 01:31:00 0.013 1 4 5 True 2.0
2017-06-06 01:32:00 0.001 1 4 5 True 2.0
2017-06-06 01:33:00 0.020 1 4 5 True 2.0
2017-06-06 01:38:00 0.050 0 5 1 False NaN
2017-06-06 01:40:00 0.025 0 6 6 True 3.0
2017-06-06 01:41:00 0.010 1 6 6 True 3.0
2017-06-06 01:42:00 0.008 1 6 6 True 3.0
2017-06-06 01:43:00 0.009 1 6 6 True 3.0
2017-06-06 01:44:00 0.038 1 6 6 True 3.0
2017-06-06 01:45:00 0.038 1 6 6 True 3.0
答案 1 :(得分:3)
可能有更优雅的方法来解决这个问题,但这应该可以解决问题:
# True where the previous timestamp is one minute away
prev_ok = pd.Series(df['Date'].diff().values == np.timedelta64(1, 'm'))
# True where the previous four rows of prev_ok are True
a = prev_ok.rolling(4).sum() == 4
# extend True back down the previous four rows, this could be done with a loop
b = a | a.shift(-1) | a.shift(-2) | a.shift(-3) | a.shift(-4)
# calculate edges from False to True to get the event indices
c = (~a.shift(-3).fillna(False) & a.shift(-4)).cumsum()
# only display event indices where b is
df['Event'] = c.mask(~b)
输出
Date X Event
0 2017-06-06 01:08:00 0.019 1.0
1 2017-06-06 01:09:00 0.005 1.0
2 2017-06-06 01:10:00 0.030 1.0
3 2017-06-06 01:11:00 0.005 1.0
4 2017-06-06 01:12:00 0.003 1.0
5 2017-06-06 01:13:00 0.001 1.0
6 2017-06-06 01:14:00 0.039 1.0
7 2017-06-06 01:15:00 0.003 1.0
8 2017-06-06 01:17:00 0.001 NaN
9 2017-06-06 01:25:00 0.006 NaN
10 2017-06-06 01:26:00 0.006 NaN
11 2017-06-06 01:27:00 0.032 NaN
12 2017-06-06 01:29:00 0.013 2.0
13 2017-06-06 01:30:00 0.065 2.0
14 2017-06-06 01:31:00 0.013 2.0
15 2017-06-06 01:32:00 0.001 2.0
16 2017-06-06 01:33:00 0.020 2.0
17 2017-06-06 01:38:00 0.050 NaN
18 2017-06-06 01:40:00 0.025 NaN
19 2017-06-06 01:42:00 0.010 NaN
20 2017-06-06 01:43:00 0.008 NaN
21 2017-06-06 01:44:00 0.009 NaN
22 2017-06-06 01:45:00 0.038 NaN
有趣的事实:a
的计算基本上是具有长度为4的结构元素的1-D图像的erosion,并且b的计算是具有相同的dilation结构元素。总之,来自b
的{{1}}的计算是opening,即prev_ok
中的True
prev_ok
True
b
}只有True
是连续五个True
组的一部分。
答案 2 :(得分:1)
我可能会给出一个类似的例子,你可以即兴发挥,但没有任何外部库。
例如,与您的时间间隔数据相关联,我们有一个简单的列表
x = [1,2,3,4,6,7,8,9,10,11,12,14,15,17,18,19,20,21,22,23,24,25,28,30,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
我们将表示数据True
,如果它至少为5个术语是“连续的”。
为此,首先我们获取差异列表,如果差异为1,则分配1
,否则分配0
。这是由y = [int(round(1/(x[i]-x[i-1]))) for i in range(1, len(x))]
完成的。
在此之后,我们将获得指向差异0
的索引,并使用它来检查您的状况。
完整代码:
import copy
x = [1,2,3,4,6,7,8,9,10,11,12,14,15,17,18,19,20,21,22,23,24,25,28,30,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
y = [int(round(1/(x[i]-x[i-1]))) for i in range(1, len(x))]
z = copy.deepcopy(y);
zeros_check = [abs(y[i]-1)*i for i in range(0,len(y))];
zeros_id = list(set(zeros_check));
zeros_id.remove(0);
zeros_id.append(len(y));
idx = 0;
for i in zeros_id:
if sum(y[idx+1:i])>=5:
z[idx+1:i] = [True for i in range(idx,idx+i-1)];
else:
z[idx:i+1] = [False for i in range(idx,idx+i+1)];
idx = i;
for i,j,k in zip(x,y,z):
print(i,j,k)
输出
1 1 False
2 1 False
3 1 False
4 0 False
6 1 True
7 1 True
8 1 True
9 1 True
10 1 True
11 1 True
12 0 False
14 1 False
15 0 False
17 1 True
18 1 True
19 1 True
20 1 True
21 1 True
22 1 True
23 1 True
24 1 True
25 0 False
28 0 False
30 0 False
32 0 False
34 1 True
35 1 True
36 1 True
37 1 True
38 1 True
39 1 True
40 1 True
41 1 True
42 1 True
43 1 True
44 1 True
45 1 True
46 1 True
47 1 True
48 1 True
49 1 True