我有以下代码基本上试图找到具有相同id的关闭事件组:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
times = pd.date_range('1/1/2011', periods=72, freq='M')
times = times[(times < times[20]) | (times > times[40])]
df = pd.DataFrame({"value" : np.random.rand(len(times)), "times" : times, "id": np.random.randint(4, size=len(times))})
res = df.groupby("id").apply(lambda x: (x['times'].diff() > np.timedelta64(60, 'D')).astype('int').cumsum())
print(res)
结果类似于以下内容:
id
0 1 0
4 1
8 2
10 3
11 3
12 3
17 4
31 5
36 6
39 7
40 7
47 8
49 9
1 3 0
6 1
14 2
15 2
16 2
19 3
25 4
29 5
35 6
37 7
44 8
46 9
50 10
2 5 0
7 1
13 2
18 3
20 4
23 5
24 5
26 6
27 6
30 7
45 8
3 0 0
2 0
9 1
21 2
22 2
28 3
32 4
33 4
34 4
38 5
41 6
42 6
43 6
48 7
Name: times, dtype: int64
例如,在这里,我知道事件41、42和43来自同一组:它们具有相同的ID(0),并且它们在时间上彼此接近。
现在,我想将此数据作为一个新列退回到原始数据框中:如何处理?
我尝试了apply
,reset_index
等的各种组合,但看来我做不到。
答案 0 :(得分:2)
IIUC,我想您想使用transform
。其次,尝试将np.random.seed(123)添加到这些代码和预期的输出中,这样我们可以验证结果。
import pandas as pd
import numpy as np
times = pd.date_range('1/1/2011', periods=72, freq='M')
times = times[(times < times[20]) | (times > times[40])]
df = pd.DataFrame({"value" : np.random.rand(len(times)), "times" : times, "id": np.random.randint(4, size=len(times))})
df['SameGroup'] = df.groupby("id")['times'].transform(lambda x: (x.diff() > np.timedelta64(60, 'D')).astype('int').cumsum())
print(df.sort_values(['id','times']))
输出:
value times id SameGroup
1 0.991668 2011-02-28 0 0
4 0.526418 2011-05-31 0 1
11 0.102302 2011-12-31 0 2
15 0.196234 2012-04-30 0 3
23 0.121400 2014-09-30 0 4
26 0.657766 2014-12-31 0 5
31 0.009018 2015-05-31 0 6
32 0.885023 2015-06-30 0 6
33 0.770459 2015-07-31 0 6
36 0.233050 2015-10-31 0 7
43 0.345321 2016-05-31 0 8
44 0.576960 2016-06-30 0 8
47 0.946987 2016-09-30 0 9
49 0.441697 2016-11-30 0 10
5 0.919395 2011-06-30 1 0
8 0.771437 2011-09-30 1 1
10 0.668462 2011-11-30 1 2
16 0.418372 2012-05-31 1 3
19 0.140115 2012-08-31 1 4
20 0.398020 2014-06-30 1 5
22 0.419557 2014-08-31 1 6
28 0.466919 2015-02-28 1 7
38 0.329871 2015-12-31 1 8
39 0.941279 2016-01-31 1 8
40 0.826048 2016-02-29 1 8
45 0.860163 2016-07-31 1 9
0 0.767486 2011-01-31 2 0
3 0.935697 2011-04-30 2 1
6 0.354937 2011-07-31 2 2
7 0.910906 2011-08-31 2 2
9 0.577648 2011-10-31 2 3
12 0.998919 2012-01-31 2 4
17 0.447130 2012-06-30 2 5
24 0.101906 2014-10-31 2 6
30 0.364872 2015-04-30 2 7
34 0.101173 2015-08-31 2 8
42 0.300244 2016-04-30 2 9
46 0.100143 2016-08-31 2 10
50 0.207622 2016-12-31 2 11
2 0.582782 2011-03-31 3 0
13 0.919462 2012-02-29 3 1
14 0.993302 2012-03-31 3 1
18 0.009203 2012-07-31 3 2
21 0.192862 2014-07-31 3 3
25 0.686448 2014-11-30 3 4
27 0.493378 2015-01-31 3 5
29 0.104054 2015-03-31 3 5
35 0.082092 2015-09-30 3 6
37 0.321680 2015-11-30 3 7
41 0.042734 2016-03-31 3 8
48 0.124706 2016-10-31 3 9