熊猫groupby cumsum与groupby sum有何不同?

时间:2020-10-05 16:56:17

标签: python pandas group-by

我有一个数据集,其中一个项目代码下可能有多个订阅,类似于下面创建的订阅:

data = {'Project Code': [1622, 1622,1622,1622,1622,1622,1622,1622],
'Subscription Line': [1,2,1,2,1,2,1,1]
'Date': [4/1/2020, 4/1/2020, 5/1/2020, 5/1/2020, 6/1/2020, 6/1/2020, 7/1/2020, 8/1/2020],
'Subscription Spend': [ 293, 195, 31, 200, 0, 0, 3270,184],
'Projected Subscription Spend': [11758, 8970, 12261, 6807, 9963, 5480, 11885, 9900],
'Project-Month':['1622April2020', ' 1622April2020', '1622May2020', '1622May2020', '1622June2020', '1622June2020', '1622July2020', '1622August2020']
}
df = pd.DataFrame (data, columns = ['Project Code', 'Date', 'Subscription Spend', 'Projected Subscription Spend', 'Project-Month'])

我想计算一列,该列将项目级别的计划支出计算为“计划订阅支出”的总和。因此,对于2020年4月,预计项目支出将为11,758 + 8,970 = 20,728,这将在两行中显示。因此,预计的项目支出将如下所示:

'Projected Project Spend' = [20728, 20728, 19068, 19068, 15443, 15443, 11885, 9900]

我尝试使用groupby和sum来执行此操作,但是当我运行代码时,“ Projected Project Spend”中出现了空白。但是,当我使用cumsum时,我得到的值以您期望cumsum的方式起作用-它们随着时间的推移而累加。我尝试的两行代码如下:

df['Projected Project Spend'] = (df['Subscription Spend']).groupby(df['Subscription Code']).sum()
df['Projected Project Spend'] = (df['Projected Subscription Spend']).groupby(df['Project-Month']).cumsum()

为什么cumsum没有时sum的输出为空?我该如何做总和?

2 个答案:

答案 0 :(得分:0)

尝试此代码

data = {'Project Code': [1622, 1622,1622,1622,1622,1622,1622,1622],
'Subscription Line': [1,2,1,2,1,2,1,1],
'Date': [4/1/2020, 4/1/2020, 5/1/2020, 5/1/2020, 6/1/2020, 6/1/2020, 7/1/2020, 8/1/2020],
'Subscription Spend': [ 293, 195, 31, 200, 0, 0, 3270,184],
'Projected Subscription Spend': [11758, 8970, 12261, 6807, 9963, 5480, 11885, 9900],
'Project-Month':['1622April2020', ' 1622April2020', '1622May2020', '1622May2020', '1622June2020', '1622June2020', '1622July2020', '1622August2020']
}

df = pd.DataFrame (data, columns = ['Project Code', 'Date', 'Subscription Spend', 'Projected Subscription Spend', 'Project-Month'])

df['month']=df['Project-Month'].str[4:-4] #create a new column for month
df.iloc[1,-1]='April' # second row was reading 2April so corrected it
df.groupby(['month'],axis=0).sum()['Projected Subscription Spend'] 

在项目月份,groupby可能无法正常工作,因为第二行的格式有误,我已对其进行了纠正。

答案 1 :(得分:0)

类似于Chris的评论,但使用'sum'获得更好的效果:

df['Total_Spend'] = (df.groupby(['Project Code', 'Date'])
                        ['Projected Subscription Spend'].transform('sum')
                    )

输出:

      Project Code  Date        Subscription Spend    Projected Subscription Spend  Project-Month      Total_Spend
--  --------------  --------  --------------------  ------------------------------  ---------------  -------------
 0            1622  4/1/2020                   293                           11758  1622April2020            20728
 1            1622  4/1/2020                   195                            8970  1622April2020            20728
 2            1622  5/1/2020                    31                           12261  1622May2020              19068
 3            1622  5/1/2020                   200                            6807  1622May2020              19068
 4            1622  6/1/2020                     0                            9963  1622June2020             15443
 5            1622  6/1/2020                     0                            5480  1622June2020             15443
 6            1622  7/1/2020                  3270                           11885  1622July2020             11885
 7            1622  8/1/2020                   184                            9900  1622August2020            9900