我的问题类似于(Python pandas: how to run multiple univariate regression by group)。我有一组按组运行的回归,但在我的情况下,回归系数的界限在0和1之间,并且存在一个约束,即回归系数的总和应该是= 1。 我试图将其解决为优化问题;首先使用整个数据框(忽略组)。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'y0': np.random.randn(20),
'y1': np.random.randn(20),
'x0': np.random.randn(20),
'x1': np.random.randn(20),
'grpVar': ['a', 'b'] * 10})
def SumSqDif(a):
return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2 )
# Starting values
startVal = np.ones(2)*(1/2)
#Constraint Sum of coefficients = 0
cons = ({'type':'eq', 'fun': lambda x: 1 - sum(x)})
# Bounds on coefficients
bnds = tuple([0,1] for x in startVal)
# Solve the optimization problem using the full dataframe (disregarding groups)
from scipy.optimize import minimize
Result = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
Result.x
然后我尝试使用数据框group
和apply()
。但是我得到的错误是
TypeError:不可用类型:'numpy.ndarray'。
# Try to Solve the optimization problem By group
# Create GroupBy object
grp_grpVar = df.groupby('grpVar')
def RunMinimize(data):
ResultByGrp = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
return ResultByGrp.x
grp_grpVar.apply(RunMinimize(df))
这可能可以通过循环迭代来完成,但是我的实际数据包含大约7000万个组,我认为数据帧分组和apply()
会更有效。
我是Python的新手。我搜索了此网站和其他网站但找不到任何数据框apply()
和scipy.optimize.minimize
的示例。
任何想法将不胜感激?
答案 0 :(得分:0)
我相信你想要的是:
# add df parameter to your `SumSqDif` function signature, so that when you apply
# this function to your grouped by dataframe, the groups gets passed
# as the df argument to this function
def SumSqDif(a, df):
return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2)
# add startVal, bnds, and cons as additional parameters
# The way you wrote your function signature is that it
# uses these values from the global namespace, which is not good practice,
# because you're assuming these values exist in the global scope,
# which may not always be true
def RunMinimize(data, startVal, bnds, cons):
# add additional argument of data into the minimize function
# this passes the group as the df to SumSqDif
ResultByGrp = minimize(SumSqDif, startVal, method='SLSQP',
bounds=bnds, constraints = cons, args=(data))
return ResultByGrp.x
# Here, you're passing the startVal, bnds, and cons are arguments as
# additional keyword arguments to `apply`
df.groupby('grpVar').apply(RunMinimize, startVal=startVal, bnds=bnds, cons=cons))