我有一个数据框,其中包含时间标签列,卫星ID和网站ID。我的目标是将数据集分解为单个"轨道"每个"跟踪"是卫星和站点ID的独特组合。我可以使用标准pandas groupby功能并指定by=['site', 'sat']
轻松完成此操作。但另一个警告是,如果一个组内的时间间隔超过N分钟,那么时间间隔之后的数据应该成为一个新的跟踪"。
我的问题是计算我的(站点,坐标)组中连续行之间的时间点的最佳方法,确定timedelta何时大于N分钟,并创建一个新的组/轨道?
我想我可以使用diff()方法计算行之间的时间差。理想情况下,有一种方法可以将第三个键添加到我的groupby调用中,该调用封装了我使用的时间限制。
以下是一些示例代码,用于生成测试数据集并执行初始站点,分组。
import pandas as pd
import numpy as np
# Create first sample set.
N=10
A_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
A_data = np.arange(0, N)
A_site = ['X'] * N
A_sat = 12345
# Create second sample set over the same time span but with a different sat
N=5
B_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
B_data = np.arange(0, N)
B_site = ['X'] * N
B_sat = 3456
# Create a third sample set with a new site over the same time span but the
# same sat as the second set
N = 10
C_times = pd.date_range('2016-01-01T00:01:00', periods=N, freq='1s')
C_data = np.arange(0, N)
C_site = ['Y'] * N
C_sat = 3456
# Create a fourth sample set with the same sat and site as the third set but
# more than 20 minutes after the third set.
N = 5
D_times = pd.date_range('2016-01-01T01:00:00', periods=N, freq='1s')
D_data = np.arange(0, N)
D_site = ['Y'] * N
D_sat = 3456
# Build a data frame for each sample set
A = pd.DataFrame(index=A_times, data={'data': A_data, 'site' : A_site, 'sat' : A_sat})
B = pd.DataFrame(index=B_times, data={'data': B_data, 'site' : B_site, 'sat' : B_sat})
C = pd.DataFrame(index=C_times, data={'data': C_data, 'site' : C_site, 'sat' : C_sat})
D = pd.DataFrame(index=D_times, data={'data': D_data, 'site' : D_site, 'sat' : D_sat})
# mash them into one larger test data frame
test = pd.concat([A, B, C, D])
print(test)
my_groups = test.groupby(by = ['site', 'sat'])
for key, g in my_groups:
print(key)
print(g)
这是
的输出test =
data sat site
2016-01-01 00:00:00 0 12345 X
2016-01-01 00:00:01 1 12345 X
2016-01-01 00:00:02 2 12345 X
2016-01-01 00:00:03 3 12345 X
2016-01-01 00:00:04 4 12345 X
2016-01-01 00:00:05 5 12345 X
2016-01-01 00:00:06 6 12345 X
2016-01-01 00:00:07 7 12345 X
2016-01-01 00:00:08 8 12345 X
2016-01-01 00:00:09 9 12345 X
2016-01-01 00:00:00 0 3456 X
2016-01-01 00:00:01 1 3456 X
2016-01-01 00:00:02 2 3456 X
2016-01-01 00:00:03 3 3456 X
2016-01-01 00:00:04 4 3456 X
2016-01-01 00:01:00 0 3456 Y
2016-01-01 00:01:01 1 3456 Y
2016-01-01 00:01:02 2 3456 Y
2016-01-01 00:01:03 3 3456 Y
2016-01-01 00:01:04 4 3456 Y
2016-01-01 00:01:05 5 3456 Y
2016-01-01 00:01:06 6 3456 Y
2016-01-01 00:01:07 7 3456 Y
2016-01-01 00:01:08 8 3456 Y
2016-01-01 00:01:09 9 3456 Y
2016-01-01 01:00:00 0 3456 Y
2016-01-01 01:00:01 1 3456 Y
2016-01-01 01:00:02 2 3456 Y
2016-01-01 01:00:03 3 3456 Y
2016-01-01 01:00:04 4 3456 Y
和各个小组
('X', 3456)
data sat site
2016-01-01 00:00:00 0 3456 X
2016-01-01 00:00:01 1 3456 X
2016-01-01 00:00:02 2 3456 X
2016-01-01 00:00:03 3 3456 X
2016-01-01 00:00:04 4 3456 X
('X', 12345)
data sat site
2016-01-01 00:00:00 0 12345 X
2016-01-01 00:00:01 1 12345 X
2016-01-01 00:00:02 2 12345 X
2016-01-01 00:00:03 3 12345 X
2016-01-01 00:00:04 4 12345 X
2016-01-01 00:00:05 5 12345 X
2016-01-01 00:00:06 6 12345 X
2016-01-01 00:00:07 7 12345 X
2016-01-01 00:00:08 8 12345 X
2016-01-01 00:00:09 9 12345 X
('Y', 3456)
data sat site
2016-01-01 00:01:00 0 3456 Y
2016-01-01 00:01:01 1 3456 Y
2016-01-01 00:01:02 2 3456 Y
2016-01-01 00:01:03 3 3456 Y
2016-01-01 00:01:04 4 3456 Y
2016-01-01 00:01:05 5 3456 Y
2016-01-01 00:01:06 6 3456 Y
2016-01-01 00:01:07 7 3456 Y
2016-01-01 00:01:08 8 3456 Y
2016-01-01 00:01:09 9 3456 Y
2016-01-01 01:00:00 0 3456 Y
2016-01-01 01:00:01 1 3456 Y
2016-01-01 01:00:02 2 3456 Y
2016-01-01 01:00:03 3 3456 Y
2016-01-01 01:00:04 4 3456 Y
理想的行为是,由于数据中存在20分钟的差距,上面的第三组实际上应该分成两组,例如
('Y', 3456)
data sat site
2016-01-01 00:01:00 0 3456 Y
2016-01-01 00:01:01 1 3456 Y
2016-01-01 00:01:02 2 3456 Y
2016-01-01 00:01:03 3 3456 Y
2016-01-01 00:01:04 4 3456 Y
2016-01-01 00:01:05 5 3456 Y
2016-01-01 00:01:06 6 3456 Y
2016-01-01 00:01:07 7 3456 Y
2016-01-01 00:01:08 8 3456 Y
2016-01-01 00:01:09 9 3456 Y
New group here
2016-01-01 01:00:00 0 3456 Y
2016-01-01 01:00:01 1 3456 Y
2016-01-01 01:00:02 2 3456 Y
2016-01-01 01:00:03 3 3456 Y
2016-01-01 01:00:04 4 3456 Y
任何建议将不胜感激。谢谢!
答案 0 :(得分:1)
在摆弄unutbu的解决方案并将其与this post的建议相结合后,我能够解决问题。更完整的示例测试集和解决方案如下所示。
# In[222]:
import pandas as pd
import numpy as np
# In[223]:
# Create 1st sample set.
N=10
A_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
A_data = np.arange(0, N)
A_site = ['X'] * N
A_sat = 12345
# Create 2nd sample set over the same time span but with a different sat
N=5
B_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
B_data = np.arange(0, N)
B_site = ['X'] * N
B_sat = 3456
# Create a 3rd sample set with a new site over the same time span but the
# same sat as the 2nd set
N = 10
C_times = pd.date_range('2016-01-01T00:01:00', periods=N, freq='1s')
C_data = np.arange(0, N)
C_site = ['Y'] * N
C_sat = 3456
# Create a 4th sample set with the same sat and site as the 3rd set but
# more than 20 minutes after the third set.
N = 5
D_times = pd.date_range('2016-01-01T01:00:00', periods=N, freq='1s')
D_data = np.arange(0, N)
D_site = ['Y'] * N
D_sat = 3456
# Create a 5th sample set with the same sat and site as the 3rd set but
# more than 20 minutes after the third set.
N = 60
E_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='60s')
E_data = np.arange(0, N)
E_site = ['Z'] * N
E_sat = 3456
# Create a 6th sample set with the same sat and site as the 4th set but
# more than 20 minutes after the fourth set.
N = 5
F_times = pd.date_range('2016-01-02T00:00:00', periods=N, freq='60s')
F_data = np.arange(0, N)
F_site = ['Y'] * N
F_sat = 3456
# In[224]:
# Build a data frame for each sample set
A = pd.DataFrame(data={'time': A_times, 'data': A_data, 'site' : A_site, 'sat' : A_sat})
B = pd.DataFrame(data={'time': B_times, 'data': B_data, 'site' : B_site, 'sat' : B_sat})
C = pd.DataFrame(data={'time': C_times, 'data': C_data, 'site' : C_site, 'sat' : C_sat})
D = pd.DataFrame(data={'time': D_times, 'data': D_data, 'site' : D_site, 'sat' : D_sat})
E = pd.DataFrame(data={'time': E_times, 'data': E_data, 'site' : E_site, 'sat' : E_sat})
F = pd.DataFrame(data={'time': F_times, 'data': F_data, 'site' : F_site, 'sat' : F_sat})
# mash them into one larger test data frame
test = pd.concat([A, B, C, D, E, F])
# In[225]:
print(test)
# In[226]:
test.sort_values(['time'], inplace=True)
# In[227]:
# This approach doesn't quite work. Note that the group (Y, 3456, 0) really
# has 2 tracks in it because the overlapping track from (Z, 3456, 0) is screwing up
# the delta-t calculation and hiding the fact that within the Y, 3456 group there
# was a large time gap.
test1 = test.copy()
test1['delta_t'] = test1['time'].diff()
test1['track'] = (test1['delta_t'] > pd.Timedelta(minutes=20)).cumsum()
my_groups = test1.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
print(key)
print(g)
这个输出看起来像这样:
('X', 3456, 0)
data sat site time delta_t track
0 0 3456 X 2016-01-01 00:00:00 0 days 0
1 1 3456 X 2016-01-01 00:00:01 0 days 0
2 2 3456 X 2016-01-01 00:00:02 0 days 0
3 3 3456 X 2016-01-01 00:00:03 0 days 0
4 4 3456 X 2016-01-01 00:00:04 0 days 0
('X', 12345, 0)
data sat site time delta_t track
0 0 12345 X 2016-01-01 00:00:00 NaT 0
1 1 12345 X 2016-01-01 00:00:01 00:00:01 0
2 2 12345 X 2016-01-01 00:00:02 00:00:01 0
3 3 12345 X 2016-01-01 00:00:03 00:00:01 0
4 4 12345 X 2016-01-01 00:00:04 00:00:01 0
5 5 12345 X 2016-01-01 00:00:05 00:00:01 0
6 6 12345 X 2016-01-01 00:00:06 00:00:01 0
7 7 12345 X 2016-01-01 00:00:07 00:00:01 0
8 8 12345 X 2016-01-01 00:00:08 00:00:01 0
9 9 12345 X 2016-01-01 00:00:09 00:00:01 0
('Y', 3456, 0)
data sat site time delta_t track
0 0 3456 Y 2016-01-01 00:01:00 00:00:00 0
1 1 3456 Y 2016-01-01 00:01:01 00:00:01 0
2 2 3456 Y 2016-01-01 00:01:02 00:00:01 0
3 3 3456 Y 2016-01-01 00:01:03 00:00:01 0
4 4 3456 Y 2016-01-01 00:01:04 00:00:01 0
5 5 3456 Y 2016-01-01 00:01:05 00:00:01 0
6 6 3456 Y 2016-01-01 00:01:06 00:00:01 0
7 7 3456 Y 2016-01-01 00:01:07 00:00:01 0
8 8 3456 Y 2016-01-01 00:01:08 00:00:01 0
9 9 3456 Y 2016-01-01 00:01:09 00:00:01 0
0 0 3456 Y 2016-01-01 01:00:00 00:01:00 0
1 1 3456 Y 2016-01-01 01:00:01 00:00:01 0
2 2 3456 Y 2016-01-01 01:00:02 00:00:01 0
3 3 3456 Y 2016-01-01 01:00:03 00:00:01 0
4 4 3456 Y 2016-01-01 01:00:04 00:00:01 0
('Y', 3456, 1)
data sat site time delta_t track
0 0 3456 Y 2016-01-02 00:00:00 22:59:56 1
1 1 3456 Y 2016-01-02 00:01:00 00:01:00 1
2 2 3456 Y 2016-01-02 00:02:00 00:01:00 1
3 3 3456 Y 2016-01-02 00:03:00 00:01:00 1
4 4 3456 Y 2016-01-02 00:04:00 00:01:00 1
('Z', 3456, 0)
data sat site time delta_t track
0 0 3456 Z 2016-01-01 00:00:00 00:00:00 0
1 1 3456 Z 2016-01-01 00:01:00 00:00:51 0
2 2 3456 Z 2016-01-01 00:02:00 00:00:51 0
3 3 3456 Z 2016-01-01 00:03:00 00:01:00 0
4 4 3456 Z 2016-01-01 00:04:00 00:01:00 0
5 5 3456 Z 2016-01-01 00:05:00 00:01:00 0
6 6 3456 Z 2016-01-01 00:06:00 00:01:00 0
7 7 3456 Z 2016-01-01 00:07:00 00:01:00 0
8 8 3456 Z 2016-01-01 00:08:00 00:01:00 0
9 9 3456 Z 2016-01-01 00:09:00 00:01:00 0
10 10 3456 Z 2016-01-01 00:10:00 00:01:00 0
11 11 3456 Z 2016-01-01 00:11:00 00:01:00 0
12 12 3456 Z 2016-01-01 00:12:00 00:01:00 0
13 13 3456 Z 2016-01-01 00:13:00 00:01:00 0
14 14 3456 Z 2016-01-01 00:14:00 00:01:00 0
15 15 3456 Z 2016-01-01 00:15:00 00:01:00 0
16 16 3456 Z 2016-01-01 00:16:00 00:01:00 0
17 17 3456 Z 2016-01-01 00:17:00 00:01:00 0
18 18 3456 Z 2016-01-01 00:18:00 00:01:00 0
19 19 3456 Z 2016-01-01 00:19:00 00:01:00 0
20 20 3456 Z 2016-01-01 00:20:00 00:01:00 0
21 21 3456 Z 2016-01-01 00:21:00 00:01:00 0
22 22 3456 Z 2016-01-01 00:22:00 00:01:00 0
23 23 3456 Z 2016-01-01 00:23:00 00:01:00 0
24 24 3456 Z 2016-01-01 00:24:00 00:01:00 0
25 25 3456 Z 2016-01-01 00:25:00 00:01:00 0
26 26 3456 Z 2016-01-01 00:26:00 00:01:00 0
27 27 3456 Z 2016-01-01 00:27:00 00:01:00 0
28 28 3456 Z 2016-01-01 00:28:00 00:01:00 0
29 29 3456 Z 2016-01-01 00:29:00 00:01:00 0
30 30 3456 Z 2016-01-01 00:30:00 00:01:00 0
31 31 3456 Z 2016-01-01 00:31:00 00:01:00 0
32 32 3456 Z 2016-01-01 00:32:00 00:01:00 0
33 33 3456 Z 2016-01-01 00:33:00 00:01:00 0
34 34 3456 Z 2016-01-01 00:34:00 00:01:00 0
35 35 3456 Z 2016-01-01 00:35:00 00:01:00 0
36 36 3456 Z 2016-01-01 00:36:00 00:01:00 0
37 37 3456 Z 2016-01-01 00:37:00 00:01:00 0
38 38 3456 Z 2016-01-01 00:38:00 00:01:00 0
39 39 3456 Z 2016-01-01 00:39:00 00:01:00 0
40 40 3456 Z 2016-01-01 00:40:00 00:01:00 0
41 41 3456 Z 2016-01-01 00:41:00 00:01:00 0
42 42 3456 Z 2016-01-01 00:42:00 00:01:00 0
43 43 3456 Z 2016-01-01 00:43:00 00:01:00 0
44 44 3456 Z 2016-01-01 00:44:00 00:01:00 0
45 45 3456 Z 2016-01-01 00:45:00 00:01:00 0
46 46 3456 Z 2016-01-01 00:46:00 00:01:00 0
47 47 3456 Z 2016-01-01 00:47:00 00:01:00 0
48 48 3456 Z 2016-01-01 00:48:00 00:01:00 0
49 49 3456 Z 2016-01-01 00:49:00 00:01:00 0
50 50 3456 Z 2016-01-01 00:50:00 00:01:00 0
51 51 3456 Z 2016-01-01 00:51:00 00:01:00 0
52 52 3456 Z 2016-01-01 00:52:00 00:01:00 0
53 53 3456 Z 2016-01-01 00:53:00 00:01:00 0
54 54 3456 Z 2016-01-01 00:54:00 00:01:00 0
55 55 3456 Z 2016-01-01 00:55:00 00:01:00 0
56 56 3456 Z 2016-01-01 00:56:00 00:01:00 0
57 57 3456 Z 2016-01-01 00:57:00 00:01:00 0
58 58 3456 Z 2016-01-01 00:58:00 00:01:00 0
59 59 3456 Z 2016-01-01 00:59:00 00:01:00 0
请注意,('Y',3456,0)的组中确实有两个轨道。所以不是一个完整的解决方案。继续我尝试了这个
# In[228]:
# This method works. The difference is that when I calculated the
# delta_t I did it on the results of the groupby.
# There's a undesireable effect that the track counter doesn't reset with each new
# (site, sat) pair. It appears to keep counting up.
test2 = test.copy()
test2.sort_values(['site', 'sat', 'time'], inplace=True)
test2['delta_t'] = test2.groupby(['site', 'sat'])['time'].diff()
test2['track'] = (test2['delta_t'] > pd.Timedelta(minutes=20)).cumsum()
my_groups = test2.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
print(key)
print(g)
有输出
('X', 3456, 0)
data sat site time delta_t track
0 0 3456 X 2016-01-01 00:00:00 NaT 0
1 1 3456 X 2016-01-01 00:00:01 00:00:01 0
2 2 3456 X 2016-01-01 00:00:02 00:00:01 0
3 3 3456 X 2016-01-01 00:00:03 00:00:01 0
4 4 3456 X 2016-01-01 00:00:04 00:00:01 0
('X', 12345, 0)
data sat site time delta_t track
0 0 12345 X 2016-01-01 00:00:00 NaT 0
1 1 12345 X 2016-01-01 00:00:01 00:00:01 0
2 2 12345 X 2016-01-01 00:00:02 00:00:01 0
3 3 12345 X 2016-01-01 00:00:03 00:00:01 0
4 4 12345 X 2016-01-01 00:00:04 00:00:01 0
5 5 12345 X 2016-01-01 00:00:05 00:00:01 0
6 6 12345 X 2016-01-01 00:00:06 00:00:01 0
7 7 12345 X 2016-01-01 00:00:07 00:00:01 0
8 8 12345 X 2016-01-01 00:00:08 00:00:01 0
9 9 12345 X 2016-01-01 00:00:09 00:00:01 0
('Y', 3456, 0)
data sat site time delta_t track
0 0 3456 Y 2016-01-01 00:01:00 NaT 0
1 1 3456 Y 2016-01-01 00:01:01 00:00:01 0
2 2 3456 Y 2016-01-01 00:01:02 00:00:01 0
3 3 3456 Y 2016-01-01 00:01:03 00:00:01 0
4 4 3456 Y 2016-01-01 00:01:04 00:00:01 0
5 5 3456 Y 2016-01-01 00:01:05 00:00:01 0
6 6 3456 Y 2016-01-01 00:01:06 00:00:01 0
7 7 3456 Y 2016-01-01 00:01:07 00:00:01 0
8 8 3456 Y 2016-01-01 00:01:08 00:00:01 0
9 9 3456 Y 2016-01-01 00:01:09 00:00:01 0
('Y', 3456, 1)
data sat site time delta_t track
0 0 3456 Y 2016-01-01 01:00:00 00:58:51 1
1 1 3456 Y 2016-01-01 01:00:01 00:00:01 1
2 2 3456 Y 2016-01-01 01:00:02 00:00:01 1
3 3 3456 Y 2016-01-01 01:00:03 00:00:01 1
4 4 3456 Y 2016-01-01 01:00:04 00:00:01 1
('Y', 3456, 2)
data sat site time delta_t track
0 0 3456 Y 2016-01-02 00:00:00 22:59:56 2
1 1 3456 Y 2016-01-02 00:01:00 00:01:00 2
2 2 3456 Y 2016-01-02 00:02:00 00:01:00 2
3 3 3456 Y 2016-01-02 00:03:00 00:01:00 2
4 4 3456 Y 2016-01-02 00:04:00 00:01:00 2
('Z', 3456, 2)
data sat site time delta_t track
0 0 3456 Z 2016-01-01 00:00:00 NaT 2
1 1 3456 Z 2016-01-01 00:01:00 00:01:00 2
2 2 3456 Z 2016-01-01 00:02:00 00:01:00 2
3 3 3456 Z 2016-01-01 00:03:00 00:01:00 2
4 4 3456 Z 2016-01-01 00:04:00 00:01:00 2
5 5 3456 Z 2016-01-01 00:05:00 00:01:00 2
6 6 3456 Z 2016-01-01 00:06:00 00:01:00 2
7 7 3456 Z 2016-01-01 00:07:00 00:01:00 2
8 8 3456 Z 2016-01-01 00:08:00 00:01:00 2
9 9 3456 Z 2016-01-01 00:09:00 00:01:00 2
10 10 3456 Z 2016-01-01 00:10:00 00:01:00 2
11 11 3456 Z 2016-01-01 00:11:00 00:01:00 2
12 12 3456 Z 2016-01-01 00:12:00 00:01:00 2
13 13 3456 Z 2016-01-01 00:13:00 00:01:00 2
14 14 3456 Z 2016-01-01 00:14:00 00:01:00 2
15 15 3456 Z 2016-01-01 00:15:00 00:01:00 2
16 16 3456 Z 2016-01-01 00:16:00 00:01:00 2
17 17 3456 Z 2016-01-01 00:17:00 00:01:00 2
18 18 3456 Z 2016-01-01 00:18:00 00:01:00 2
19 19 3456 Z 2016-01-01 00:19:00 00:01:00 2
20 20 3456 Z 2016-01-01 00:20:00 00:01:00 2
21 21 3456 Z 2016-01-01 00:21:00 00:01:00 2
22 22 3456 Z 2016-01-01 00:22:00 00:01:00 2
23 23 3456 Z 2016-01-01 00:23:00 00:01:00 2
24 24 3456 Z 2016-01-01 00:24:00 00:01:00 2
25 25 3456 Z 2016-01-01 00:25:00 00:01:00 2
26 26 3456 Z 2016-01-01 00:26:00 00:01:00 2
27 27 3456 Z 2016-01-01 00:27:00 00:01:00 2
28 28 3456 Z 2016-01-01 00:28:00 00:01:00 2
29 29 3456 Z 2016-01-01 00:29:00 00:01:00 2
30 30 3456 Z 2016-01-01 00:30:00 00:01:00 2
31 31 3456 Z 2016-01-01 00:31:00 00:01:00 2
32 32 3456 Z 2016-01-01 00:32:00 00:01:00 2
33 33 3456 Z 2016-01-01 00:33:00 00:01:00 2
34 34 3456 Z 2016-01-01 00:34:00 00:01:00 2
35 35 3456 Z 2016-01-01 00:35:00 00:01:00 2
36 36 3456 Z 2016-01-01 00:36:00 00:01:00 2
37 37 3456 Z 2016-01-01 00:37:00 00:01:00 2
38 38 3456 Z 2016-01-01 00:38:00 00:01:00 2
39 39 3456 Z 2016-01-01 00:39:00 00:01:00 2
40 40 3456 Z 2016-01-01 00:40:00 00:01:00 2
41 41 3456 Z 2016-01-01 00:41:00 00:01:00 2
42 42 3456 Z 2016-01-01 00:42:00 00:01:00 2
43 43 3456 Z 2016-01-01 00:43:00 00:01:00 2
44 44 3456 Z 2016-01-01 00:44:00 00:01:00 2
45 45 3456 Z 2016-01-01 00:45:00 00:01:00 2
46 46 3456 Z 2016-01-01 00:46:00 00:01:00 2
47 47 3456 Z 2016-01-01 00:47:00 00:01:00 2
48 48 3456 Z 2016-01-01 00:48:00 00:01:00 2
49 49 3456 Z 2016-01-01 00:49:00 00:01:00 2
50 50 3456 Z 2016-01-01 00:50:00 00:01:00 2
51 51 3456 Z 2016-01-01 00:51:00 00:01:00 2
52 52 3456 Z 2016-01-01 00:52:00 00:01:00 2
53 53 3456 Z 2016-01-01 00:53:00 00:01:00 2
54 54 3456 Z 2016-01-01 00:54:00 00:01:00 2
55 55 3456 Z 2016-01-01 00:55:00 00:01:00 2
56 56 3456 Z 2016-01-01 00:56:00 00:01:00 2
57 57 3456 Z 2016-01-01 00:57:00 00:01:00 2
58 58 3456 Z 2016-01-01 00:58:00 00:01:00 2
59 59 3456 Z 2016-01-01 00:59:00 00:01:00 2
赛道分解是正确的,但如果每个新的(站点,坐位)对的赛道计数器重置为0,那将是很好的。
# In[229]:
# This method works. The difference is that when I calculate the
# "track" counter I'm doing the cumulative sum on the results of
# the groupby. It also resets the track counter with each new
# site, sat group.
test3 = test.copy()
test3.sort_values(['site', 'sat', 'time'], inplace=True)
test3['delta_t'] = test3.groupby(['site', 'sat'])['time'].diff()
# calculate an intermediate flag column. If you try to eliminate this
# and put the boolean test directly into the 'track' calculation pandas
# will throw a recursion error.
test3['new_track'] = test3['delta_t'] > pd.Timedelta(minutes=20)
# The to_numeric call is used to convert from a float to an integer.
test3['track'] = pd.to_numeric(test3.groupby(['site', 'sat'])['new_track'].cumsum(), downcast='integer')
my_groups = test3.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
print(key)
print(g)
其中输出如下
('X', 3456, 0)
data sat site time delta_t new_track track
0 0 3456 X 2016-01-01 00:00:00 NaT False 0
1 1 3456 X 2016-01-01 00:00:01 00:00:01 False 0
2 2 3456 X 2016-01-01 00:00:02 00:00:01 False 0
3 3 3456 X 2016-01-01 00:00:03 00:00:01 False 0
4 4 3456 X 2016-01-01 00:00:04 00:00:01 False 0
('X', 12345, 0)
data sat site time delta_t new_track track
0 0 12345 X 2016-01-01 00:00:00 NaT False 0
1 1 12345 X 2016-01-01 00:00:01 00:00:01 False 0
2 2 12345 X 2016-01-01 00:00:02 00:00:01 False 0
3 3 12345 X 2016-01-01 00:00:03 00:00:01 False 0
4 4 12345 X 2016-01-01 00:00:04 00:00:01 False 0
5 5 12345 X 2016-01-01 00:00:05 00:00:01 False 0
6 6 12345 X 2016-01-01 00:00:06 00:00:01 False 0
7 7 12345 X 2016-01-01 00:00:07 00:00:01 False 0
8 8 12345 X 2016-01-01 00:00:08 00:00:01 False 0
9 9 12345 X 2016-01-01 00:00:09 00:00:01 False 0
('Y', 3456, 0)
data sat site time delta_t new_track track
0 0 3456 Y 2016-01-01 00:01:00 NaT False 0
1 1 3456 Y 2016-01-01 00:01:01 00:00:01 False 0
2 2 3456 Y 2016-01-01 00:01:02 00:00:01 False 0
3 3 3456 Y 2016-01-01 00:01:03 00:00:01 False 0
4 4 3456 Y 2016-01-01 00:01:04 00:00:01 False 0
5 5 3456 Y 2016-01-01 00:01:05 00:00:01 False 0
6 6 3456 Y 2016-01-01 00:01:06 00:00:01 False 0
7 7 3456 Y 2016-01-01 00:01:07 00:00:01 False 0
8 8 3456 Y 2016-01-01 00:01:08 00:00:01 False 0
9 9 3456 Y 2016-01-01 00:01:09 00:00:01 False 0
('Y', 3456, 1)
data sat site time delta_t new_track track
0 0 3456 Y 2016-01-01 01:00:00 00:58:51 True 1
1 1 3456 Y 2016-01-01 01:00:01 00:00:01 False 1
2 2 3456 Y 2016-01-01 01:00:02 00:00:01 False 1
3 3 3456 Y 2016-01-01 01:00:03 00:00:01 False 1
4 4 3456 Y 2016-01-01 01:00:04 00:00:01 False 1
('Y', 3456, 2)
data sat site time delta_t new_track track
0 0 3456 Y 2016-01-02 00:00:00 22:59:56 True 2
1 1 3456 Y 2016-01-02 00:01:00 00:01:00 False 2
2 2 3456 Y 2016-01-02 00:02:00 00:01:00 False 2
3 3 3456 Y 2016-01-02 00:03:00 00:01:00 False 2
4 4 3456 Y 2016-01-02 00:04:00 00:01:00 False 2
('Z', 3456, 0)
data sat site time delta_t new_track track
0 0 3456 Z 2016-01-01 00:00:00 NaT False 0
1 1 3456 Z 2016-01-01 00:01:00 00:01:00 False 0
2 2 3456 Z 2016-01-01 00:02:00 00:01:00 False 0
3 3 3456 Z 2016-01-01 00:03:00 00:01:00 False 0
4 4 3456 Z 2016-01-01 00:04:00 00:01:00 False 0
5 5 3456 Z 2016-01-01 00:05:00 00:01:00 False 0
6 6 3456 Z 2016-01-01 00:06:00 00:01:00 False 0
7 7 3456 Z 2016-01-01 00:07:00 00:01:00 False 0
8 8 3456 Z 2016-01-01 00:08:00 00:01:00 False 0
9 9 3456 Z 2016-01-01 00:09:00 00:01:00 False 0
10 10 3456 Z 2016-01-01 00:10:00 00:01:00 False 0
11 11 3456 Z 2016-01-01 00:11:00 00:01:00 False 0
12 12 3456 Z 2016-01-01 00:12:00 00:01:00 False 0
13 13 3456 Z 2016-01-01 00:13:00 00:01:00 False 0
14 14 3456 Z 2016-01-01 00:14:00 00:01:00 False 0
15 15 3456 Z 2016-01-01 00:15:00 00:01:00 False 0
16 16 3456 Z 2016-01-01 00:16:00 00:01:00 False 0
17 17 3456 Z 2016-01-01 00:17:00 00:01:00 False 0
18 18 3456 Z 2016-01-01 00:18:00 00:01:00 False 0
19 19 3456 Z 2016-01-01 00:19:00 00:01:00 False 0
20 20 3456 Z 2016-01-01 00:20:00 00:01:00 False 0
21 21 3456 Z 2016-01-01 00:21:00 00:01:00 False 0
22 22 3456 Z 2016-01-01 00:22:00 00:01:00 False 0
23 23 3456 Z 2016-01-01 00:23:00 00:01:00 False 0
24 24 3456 Z 2016-01-01 00:24:00 00:01:00 False 0
25 25 3456 Z 2016-01-01 00:25:00 00:01:00 False 0
26 26 3456 Z 2016-01-01 00:26:00 00:01:00 False 0
27 27 3456 Z 2016-01-01 00:27:00 00:01:00 False 0
28 28 3456 Z 2016-01-01 00:28:00 00:01:00 False 0
29 29 3456 Z 2016-01-01 00:29:00 00:01:00 False 0
30 30 3456 Z 2016-01-01 00:30:00 00:01:00 False 0
31 31 3456 Z 2016-01-01 00:31:00 00:01:00 False 0
32 32 3456 Z 2016-01-01 00:32:00 00:01:00 False 0
33 33 3456 Z 2016-01-01 00:33:00 00:01:00 False 0
34 34 3456 Z 2016-01-01 00:34:00 00:01:00 False 0
35 35 3456 Z 2016-01-01 00:35:00 00:01:00 False 0
36 36 3456 Z 2016-01-01 00:36:00 00:01:00 False 0
37 37 3456 Z 2016-01-01 00:37:00 00:01:00 False 0
38 38 3456 Z 2016-01-01 00:38:00 00:01:00 False 0
39 39 3456 Z 2016-01-01 00:39:00 00:01:00 False 0
40 40 3456 Z 2016-01-01 00:40:00 00:01:00 False 0
41 41 3456 Z 2016-01-01 00:41:00 00:01:00 False 0
42 42 3456 Z 2016-01-01 00:42:00 00:01:00 False 0
43 43 3456 Z 2016-01-01 00:43:00 00:01:00 False 0
44 44 3456 Z 2016-01-01 00:44:00 00:01:00 False 0
45 45 3456 Z 2016-01-01 00:45:00 00:01:00 False 0
46 46 3456 Z 2016-01-01 00:46:00 00:01:00 False 0
47 47 3456 Z 2016-01-01 00:47:00 00:01:00 False 0
48 48 3456 Z 2016-01-01 00:48:00 00:01:00 False 0
49 49 3456 Z 2016-01-01 00:49:00 00:01:00 False 0
50 50 3456 Z 2016-01-01 00:50:00 00:01:00 False 0
51 51 3456 Z 2016-01-01 00:51:00 00:01:00 False 0
52 52 3456 Z 2016-01-01 00:52:00 00:01:00 False 0
53 53 3456 Z 2016-01-01 00:53:00 00:01:00 False 0
54 54 3456 Z 2016-01-01 00:54:00 00:01:00 False 0
55 55 3456 Z 2016-01-01 00:55:00 00:01:00 False 0
56 56 3456 Z 2016-01-01 00:56:00 00:01:00 False 0
57 57 3456 Z 2016-01-01 00:57:00 00:01:00 False 0
58 58 3456 Z 2016-01-01 00:58:00 00:01:00 False 0
59 59 3456 Z 2016-01-01 00:59:00 00:01:00 False 0
所以我觉得我现在有一个很好的解决方案。如果您尝试将new_track
测试合并到track
计算中,那么大熊猫会抛出递归错误,这很奇怪。