无法在groupby.agg中调用函数

时间:2019-08-29 22:05:18

标签: python pandas-groupby

python的新手。因此,请原谅错误。我正在编写一个脚本,使用groupby.agg对熊猫数据框进行分组。尝试调用以lambda函数的输出为输入的函数时出现错误

这是合并数据帧的示例

cprdf.iloc[5:10,5:20]

Out[237]: 

   Loan Nbr  Servicer Loan Nbr Recon  Action Code  Loan Count_x  \

5  21522594           25701889     Y         0.00             1   
6  21522594           25701889     Y         0.00             1   
7  21522594           25701889     Y         0.00             1   
8  21522594           25701889     Y         0.00             1   
9  21522594           25701889     Y         0.00             1   

   Days Delinquent_x Sale Date_x  UPB Beginning  UPB Purchase  UPB Sch Prin  \

5               0.00         NaN     142,936.57          0.00        162.16   
6               0.00         NaN     143,097.92          0.00        161.35   
7               0.00         NaN     143,258.47          0.00        160.55   
8               0.00         NaN     143,418.22          0.00        159.75   
9               0.00         NaN     143,735.33          0.00        317.11   

   UPB Curtailment  UPB Liq  UPB Adjustment  UPB Non Cash  UPB Ending  
5             0.00     0.00            0.00          0.00  142,774.41  
6             0.00     0.00            0.00          0.00  142,936.57  
7             0.00     0.00            0.00          0.00  143,097.92  
8             0.00     0.00            0.00          0.00  143,258.47  
9             0.00     0.00            0.00          0.00  143,418.22  

我想做的是为各种groupby操作实现以下公式

SMM =(UPB限制+ UPB Liq + UPBAdj)/(UPB开始)

心肺复苏术(%)= 100 *(1-(1-SMM)^ 12

这是相关代码


cprdf['NonSchP'] = cprdf['UPB Curtailment'] + cprdf['UPB Liq'] + \
                    cprdf['UPB Adjustment']


cprdf['SMM'] = np.where(cprdf['UPB Beginning'] == 0, 0,
                        cprdf['NonSchP']/cprdf['UPB Beginning'])



def wtavg(x):  
    return lambda x: np.average(x, weights=cprdf.loc[x.index, 'UPB Beginning'])


def cpr(y):
       z = 100 * (1 - np.power((1 - y), 12))
       return z

# dictionary for new columns

n = {'UPB_sum' : pd.NamedAgg('UPB Beginning', 'sum'),
     'UPB_count': pd.NamedAgg('UPB Beginning', 'count'),
     'PIF_sum': pd.NamedAgg('UPB Liq', 'sum'),
     'PIF_count' : pd.NamedAgg('UPB Liq', np.count_nonzero),
     'SMMAgg' : pd.NamedAgg('SMM', wtavg(cprdf['SMM'])),
     'Rate': pd.NamedAgg('Current Loan Rate',wtavg(cprdf['Current Loan Rate'])),   
     'CPR':pd.NamedAgg('SMM',cpr(wtavg(cprdf['SMM'])))}

cprgroup = cprdf.groupby(['month_year'],as_index=True).agg(**n)

cprgroup.reset_index(drop=False,inplace=True)   

我希望输出是

cprgroup

出[240]:

  month_year        UPB_sum  UPB_count      PIF_sum  PIF_count  SMM  Rate  \

0    2019-04  11,237,040.94         22   718,172.19       1.00 0.06  5.95   
1    2019-05  16,684,325.75         31         0.00       0.00 0.00  5.99   
2    2019-06 106,783,721.43        221 2,242,731.83       3.00 0.02  5.77   
3    2019-07 104,181,644.18        218 1,035,861.72       3.00 0.01  5.77   
4    2019-08 102,853,211.42        215 3,188,568.04       2.00 0.03  5.77   

    CPR  
0 54.75  
1  0.03  
2 24.07  
3 13.24  
4 31.70 

相反,当我运行程序时,出现以下错误

runfile('C:/Users/spyder-py3/untitled3.py', wdir='C:/Users/.spyder-py3')
Traceback (most recent call last):

  File "<ipython-input-241-c3f795a9d003>", line 1, in <module>
    runfile('C:/.spyder-py3/untitled3.py', wdir='C:/Users/.spyder-py3')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/.spyder-py3/untitled3.py", line 51, in <module>
    'CPR':pd.NamedAgg('SMM',cpr(wtavg(cprdf['SMM'])))}

  File "C:/Users/.spyder-py3/untitled3.py", line 39, in cpr
    z = 100 * (1 - np.power((1 - y), 12))

TypeError: unsupported operand type(s) for -: 'int' and 'function'

我是否错误地将lambda功能作为cpr函数的输入?

当我将字典'n'更改为使用'SMMAgg'作为函数的输入

'CPR':pd.NamedAgg('SMMAgg',cpr(SMMAgg))

我明白了

NameError: name 'SMMAgg' is not defined

当我将公式更改为

'CPR':pd.NamedAgg('SMMAgg',cpr('SMMAgg'))

我明白了

File "C:/Users/.spyder-py3/untitled3.py", line 39, in cpr
z = 100 * (1 - np.power((1 - y), 12))

TypeError: unsupported operand type(s) for -: 'int' and 'str'

任何帮助都会得到帮助。

我通过将聚合后的CPR函数作为新列添加到分组的数据帧中来规避错误,并能够获得所需的输出。但是在字典中调用此函数有些不明白的地方。

谢谢。

1 个答案:

答案 0 :(得分:1)

经过研究,我找到了解决方案。我注意到的一个问题(不是100%确定)是NamegAgg不接受用于聚合的多个自定义函数的同一列。因此,我创建了一个虚拟SMM列。我修改了CPR函数,方法是返回lambda,而不是将其分配给新变量并返回。我还调用了CPR函数内部的wtavg函数,并将变量数组称为输入。所以

cprdf['SMM1']=cprdf['SMM']
def wtavg():  
    return lambda x: np.average(x, weights=cprdf.loc[x.index, 'UPB Beginning'])

def cpr():
       return lambda y:  100 * (1 - np.power((1 - wtavg()(y)), 12))

然后我的kwarg字典看起来像这样

n = {'UPB_sum' : pd.NamedAgg('UPB Beginning', 'sum'),
     'UPB_count': pd.NamedAgg('UPB Beginning', 'count'),
    'PIF_sum': pd.NamedAgg('UPB Liq', 'sum'),
     'PIF_count' : pd.NamedAgg('UPB Liq', np.count_nonzero),
     'SMMAgg' : pd.NamedAgg('SMM', wtavg()),
     'Rate': pd.NamedAgg('Current Loan Rate',wtavg()),   
     'CPRAgg':pd.NamedAgg('SMM1',cpr())} 
cprgroup=cprdf.groupby(['month_year'],as_index=True).agg(**n)

输出

cprgroup
Out[51]: 
  month_year       UPB_sum  UPB_count     PIF_sum  PIF_count    SMMAgg  \
0    2019-04  1.123704e+07         22   718172.19        1.0  0.063944   
1    2019-05  1.668433e+07         31        0.00        0.0  0.000025   
2    2019-06  1.067837e+08        221  2242731.83        3.0  0.022690   
3    2019-07  1.041816e+08        218  1035861.72        3.0  0.011770   
4    2019-08  1.028532e+08        215  3188568.04        2.0  0.031268   

       Rate     CPRAgg  
0  5.946053  54.749920  
1  5.987882   0.030278  
2  5.774863  24.074820  
3  5.772602  13.244130  
4  5.771342  31.696039  

瞧!