获取groupby的结果/应用于数据框

时间:2019-01-14 13:25:00

标签: python pandas pandas-groupby

我有以下代码基本上试图找到具有相同id的关闭事件组:

#!/usr/bin/env python3

import pandas as pd                                                             
import numpy as np                                                              

times = pd.date_range('1/1/2011', periods=72, freq='M')                         
times = times[(times < times[20]) | (times > times[40])]                        

df = pd.DataFrame({"value" : np.random.rand(len(times)), "times" : times, "id": np.random.randint(4, size=len(times))})
res = df.groupby("id").apply(lambda x: (x['times'].diff() > np.timedelta64(60, 'D')).astype('int').cumsum())
print(res)

结果类似于以下内容:

id    
0   1      0
    4      1
    8      2
    10     3
    11     3
    12     3
    17     4
    31     5
    36     6
    39     7
    40     7
    47     8
    49     9
1   3      0
    6      1
    14     2
    15     2
    16     2
    19     3
    25     4
    29     5
    35     6
    37     7
    44     8
    46     9
    50    10
2   5      0
    7      1
    13     2
    18     3
    20     4
    23     5
    24     5
    26     6
    27     6
    30     7
    45     8
3   0      0
    2      0
    9      1
    21     2
    22     2
    28     3
    32     4
    33     4
    34     4
    38     5
    41     6
    42     6
    43     6
    48     7
Name: times, dtype: int64

例如,在这里,我知道事件41、42和43来自同一组:它们具有相同的ID(0),并且它们在时间上彼此接近。

现在,我想将此数据作为一个新列退回到原始数据框中:如何处理?

我尝试了applyreset_index等的各种组合,但看来我做不到。

1 个答案:

答案 0 :(得分:2)

IIUC,我想您想使用transform。其次,尝试将np.random.seed(123)添加到这些代码和预期的输出中,这样我们可以验证结果。

import pandas as pd                                                             
import numpy as np                                                              

times = pd.date_range('1/1/2011', periods=72, freq='M')                         
times = times[(times < times[20]) | (times > times[40])]                        

df = pd.DataFrame({"value" : np.random.rand(len(times)), "times" : times, "id": np.random.randint(4, size=len(times))})
df['SameGroup'] = df.groupby("id")['times'].transform(lambda x: (x.diff() > np.timedelta64(60, 'D')).astype('int').cumsum())
print(df.sort_values(['id','times']))

输出:

       value      times  id  SameGroup
1   0.991668 2011-02-28   0          0
4   0.526418 2011-05-31   0          1
11  0.102302 2011-12-31   0          2
15  0.196234 2012-04-30   0          3
23  0.121400 2014-09-30   0          4
26  0.657766 2014-12-31   0          5
31  0.009018 2015-05-31   0          6
32  0.885023 2015-06-30   0          6
33  0.770459 2015-07-31   0          6
36  0.233050 2015-10-31   0          7
43  0.345321 2016-05-31   0          8
44  0.576960 2016-06-30   0          8
47  0.946987 2016-09-30   0          9
49  0.441697 2016-11-30   0         10
5   0.919395 2011-06-30   1          0
8   0.771437 2011-09-30   1          1
10  0.668462 2011-11-30   1          2
16  0.418372 2012-05-31   1          3
19  0.140115 2012-08-31   1          4
20  0.398020 2014-06-30   1          5
22  0.419557 2014-08-31   1          6
28  0.466919 2015-02-28   1          7
38  0.329871 2015-12-31   1          8
39  0.941279 2016-01-31   1          8
40  0.826048 2016-02-29   1          8
45  0.860163 2016-07-31   1          9
0   0.767486 2011-01-31   2          0
3   0.935697 2011-04-30   2          1
6   0.354937 2011-07-31   2          2
7   0.910906 2011-08-31   2          2
9   0.577648 2011-10-31   2          3
12  0.998919 2012-01-31   2          4
17  0.447130 2012-06-30   2          5
24  0.101906 2014-10-31   2          6
30  0.364872 2015-04-30   2          7
34  0.101173 2015-08-31   2          8
42  0.300244 2016-04-30   2          9
46  0.100143 2016-08-31   2         10
50  0.207622 2016-12-31   2         11
2   0.582782 2011-03-31   3          0
13  0.919462 2012-02-29   3          1
14  0.993302 2012-03-31   3          1
18  0.009203 2012-07-31   3          2
21  0.192862 2014-07-31   3          3
25  0.686448 2014-11-30   3          4
27  0.493378 2015-01-31   3          5
29  0.104054 2015-03-31   3          5
35  0.082092 2015-09-30   3          6
37  0.321680 2015-11-30   3          7
41  0.042734 2016-03-31   3          8
48  0.124706 2016-10-31   3          9