Python scipy.optimize:如何按组运行多个单变量约束回归

时间:2017-07-03 14:08:24

标签: python scipy

我的问题类似于(Python pandas: how to run multiple univariate regression by group)。我有一组按组运行的回归,但在我的情况下,回归系数的界限在0和1之间,并且存在一个约束,即回归系数的总和应该是= 1。 我试图将其解决为优化问题;首先使用整个数据框(忽略组)。

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'y0': np.random.randn(20),
    'y1': np.random.randn(20),
    'x0': np.random.randn(20), 
    'x1': np.random.randn(20),
    'grpVar': ['a', 'b'] * 10})

def SumSqDif(a):
     return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2 )

# Starting values
startVal = np.ones(2)*(1/2)

#Constraint  Sum of coefficients = 0
cons = ({'type':'eq', 'fun': lambda x: 1 - sum(x)})

# Bounds on coefficients
bnds = tuple([0,1] for x in startVal)

# Solve the optimization problem using the full dataframe (disregarding groups)
from scipy.optimize import minimize
Result = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
Result.x

然后我尝试使用数据框groupapply()。但是我得到的错误是

  

TypeError:不可用类型:'numpy.ndarray'。

# Try to Solve the optimization problem By group
# Create GroupBy object
grp_grpVar = df.groupby('grpVar')

def RunMinimize(data):
    ResultByGrp = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
    return ResultByGrp.x

grp_grpVar.apply(RunMinimize(df))

这可能可以通过循环迭代来完成,但是我的实际数据包含大约7000万个组,我认为数据帧分组和apply()会更有效。 我是Python的新手。我搜索了此网站和其他网站但找不到任何数据框apply()scipy.optimize.minimize的示例。 任何想法将不胜感激?

1 个答案:

答案 0 :(得分:0)

我相信你想要的是:

# add df parameter to your `SumSqDif` function signature, so that when you apply
# this function to your grouped by dataframe, the groups gets passed
# as the df argument to this function
def SumSqDif(a, df):
    return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2)

# add startVal, bnds, and cons as additional parameters 
# The way you wrote your function signature is that it
# uses these values from the global namespace, which is not good practice,
# because you're assuming these values exist in the global scope,
# which may not always be true
def RunMinimize(data, startVal, bnds, cons):
    # add additional argument of data into the minimize function
    # this passes the group as the df to SumSqDif
    ResultByGrp = minimize(SumSqDif, startVal, method='SLSQP',
                           bounds=bnds, constraints = cons, args=(data))
    return ResultByGrp.x

# Here, you're passing the startVal, bnds, and cons are arguments as
# additional keyword arguments to `apply`
df.groupby('grpVar').apply(RunMinimize, startVal=startVal, bnds=bnds, cons=cons))