Question

我是熊猫的新手，我正在尝试建立队列分析。我需要包含此群组之前期间累积值的列。例如，对于此数据框

                                       Canceled  
CohortGroup NewCustomers CancelPeriod                               
2016-05     75           2016-07                    2     
                         2016-08                    5     
                         2016-09                    6     
                         2016-10                    7     
                         2016-11                    6     
                         2016-12                   2     
                         2017-01                    5             
                         2017-02                    6              
                         2017-03                   1             
                         2017-04                    5             
                         2017-05                    6             
                         2017-06                    1          
2016-06     81           2016-07                    1              
                         2016-08                    3           
                         2016-09                    4              
                         2016-10                   1           
                         2016-11                    6              
                         2016-12                   2              
                         2017-01                    5              
                         2017-02                    3              
                         2017-03                   3             
                         2017-04                    4              
                         2017-05                    4             
                         2017-06                    4             
2016-07     139          2016-07                    1              
                         2016-08                    6              
                         2016-09                   4           
                         2016-10                   8           
                         2016-11                   13           
                         2016-12                   5

我希望看到这样的输出：

                                       CanceledCustomers     TotalCancCust      
CohortGroup NewCustomers CancelPeriod                               
2016-05     75           2016-07                    2              2
                         2016-08                    5              7
                         2016-09                    6              13
                         2016-10                    7              19
                         2016-11                    6              25
                         2016-12                   2               27
                         2017-01                    5              32
                         2017-02                    6              38
                         2017-03                   1               39
                         2017-04                    5              44
                         2017-05                    6              50
                         2017-06                    1              51
2016-06     81           2016-07                    1              1
                         2016-08                    3              4
                         2016-09                    4              8
                         2016-10                   1               9
                         2016-11                    6              15
                         2016-12                   2               17
                         2017-01                    5              22
                         2017-02                    3              25
                         2017-03                   3               28
                         2017-04                    4              32
                         2017-05                    4              36
                         2017-06                    4              40
2016-07     139          2016-07                    1              1
                         2016-08                    6              7
                         2016-09                   4               11
                         2016-10                   8               19 
                         2016-11                   13              32
                         2016-12                   5               37

我该怎么做？

Answer 1

我认为您需要groupby + cumsum：

#by first level
df['TotalCancCust'] = df.groupby(level=0)['CanceledCustomers'].cumsum()
#by level with name CohortGroup
df['TotalCancCust'] = df.groupby(level='CohortGroup')['CanceledCustomers'].cumsum()

#in last version of pandas (0.20.0+) level can be omit
df['TotalCancCust'] = df.groupby('CohortGroup')['CanceledCustomers'].cumsum()

print (df)
                                       CanceledCustomers  TotalCancCust
CohortGroup NewCustomers CancelPeriod                                  
2016-05     75           2016-07                       2              2
                         2016-08                       5              7
                         2016-09                       6             13
                         2016-10                       7             20
                         2016-11                       6             26
                         2016-12                       2             28
                         2017-01                       5             33
                         2017-02                       6             39
                         2017-03                       1             40
                         2017-04                       5             45
                         2017-05                       6             51
                         2017-06                       1             52
2016-06     81           2016-07                       1              1
                         2016-08                       3              4
                         2016-09                       4              8
                         2016-10                       1              9
                         2016-11                       6             15
                         2016-12                       2             17
                         2017-01                       5             22
                         2017-02                       3             25
                         2017-03                       3             28
                         2017-04                       4             32
                         2017-05                       4             36
                         2017-06                       4             40
2016-07     139          2016-07                       1              1
                         2016-08                       6              7
                         2016-09                       4             11
                         2016-10                       8             19
                         2016-11                      13             32
                         2016-12                       5             37

Answer 2

首先向前填充您的Dataframe并执行groupby

df = df.fillna(method='ffill')
df['TotalCancCust'] = df.groupby(['CohortGroup'])['CanceledCustomers'].cumsum()

计算pandas中每个队列中前一时期的累积总和，

2 个答案: