Question

我有一个数据集，其中一个项目代码下可能有多个订阅，类似于下面创建的订阅：

data = {'Project Code': [1622, 1622,1622,1622,1622,1622,1622,1622],
'Subscription Line': [1,2,1,2,1,2,1,1]
'Date': [4/1/2020, 4/1/2020, 5/1/2020, 5/1/2020, 6/1/2020, 6/1/2020, 7/1/2020, 8/1/2020],
'Subscription Spend': [ 293, 195, 31, 200, 0, 0, 3270,184],
'Projected Subscription Spend': [11758, 8970, 12261, 6807, 9963, 5480, 11885, 9900],
'Project-Month':['1622April2020', ' 1622April2020', '1622May2020', '1622May2020', '1622June2020', '1622June2020', '1622July2020', '1622August2020']
}
df = pd.DataFrame (data, columns = ['Project Code', 'Date', 'Subscription Spend', 'Projected Subscription Spend', 'Project-Month'])

我想计算一列，该列将项目级别的计划支出计算为“计划订阅支出”的总和。因此，对于2020年4月，预计项目支出将为11,758 + 8,970 = 20,728，这将在两行中显示。因此，预计的项目支出将如下所示：

'Projected Project Spend' = [20728, 20728, 19068, 19068, 15443, 15443, 11885, 9900]

我尝试使用groupby和sum来执行此操作，但是当我运行代码时，“ Projected Project Spend”中出现了空白。但是，当我使用cumsum时，我得到的值以您期望cumsum的方式起作用-它们随着时间的推移而累加。我尝试的两行代码如下：

df['Projected Project Spend'] = (df['Subscription Spend']).groupby(df['Subscription Code']).sum()
df['Projected Project Spend'] = (df['Projected Subscription Spend']).groupby(df['Project-Month']).cumsum()

为什么cumsum没有时sum的输出为空？我该如何做总和？

Answer 1

尝试此代码

data = {'Project Code': [1622, 1622,1622,1622,1622,1622,1622,1622],
'Subscription Line': [1,2,1,2,1,2,1,1],
'Date': [4/1/2020, 4/1/2020, 5/1/2020, 5/1/2020, 6/1/2020, 6/1/2020, 7/1/2020, 8/1/2020],
'Subscription Spend': [ 293, 195, 31, 200, 0, 0, 3270,184],
'Projected Subscription Spend': [11758, 8970, 12261, 6807, 9963, 5480, 11885, 9900],
'Project-Month':['1622April2020', ' 1622April2020', '1622May2020', '1622May2020', '1622June2020', '1622June2020', '1622July2020', '1622August2020']
}

df = pd.DataFrame (data, columns = ['Project Code', 'Date', 'Subscription Spend', 'Projected Subscription Spend', 'Project-Month'])

df['month']=df['Project-Month'].str[4:-4] #create a new column for month
df.iloc[1,-1]='April' # second row was reading 2April so corrected it
df.groupby(['month'],axis=0).sum()['Projected Subscription Spend']

在项目月份，groupby可能无法正常工作，因为第二行的格式有误，我已对其进行了纠正。

Answer 2

类似于Chris的评论，但使用'sum'获得更好的效果：

df['Total_Spend'] = (df.groupby(['Project Code', 'Date'])
                        ['Projected Subscription Spend'].transform('sum')
                    )

输出：

      Project Code  Date        Subscription Spend    Projected Subscription Spend  Project-Month      Total_Spend
--  --------------  --------  --------------------  ------------------------------  ---------------  -------------
 0            1622  4/1/2020                   293                           11758  1622April2020            20728
 1            1622  4/1/2020                   195                            8970  1622April2020            20728
 2            1622  5/1/2020                    31                           12261  1622May2020              19068
 3            1622  5/1/2020                   200                            6807  1622May2020              19068
 4            1622  6/1/2020                     0                            9963  1622June2020             15443
 5            1622  6/1/2020                     0                            5480  1622June2020             15443
 6            1622  7/1/2020                  3270                           11885  1622July2020             11885
 7            1622  8/1/2020                   184                            9900  1622August2020            9900

熊猫groupby cumsum与groupby sum有何不同？

2 个答案: