Question

我的问题类似于（Python pandas: how to run multiple univariate regression by group）。我有一组按组运行的回归，但在我的情况下，回归系数的界限在0和1之间，并且存在一个约束，即回归系数的总和应该是= 1。我试图将其解决为优化问题;首先使用整个数据框（忽略组）。

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'y0': np.random.randn(20),
    'y1': np.random.randn(20),
    'x0': np.random.randn(20), 
    'x1': np.random.randn(20),
    'grpVar': ['a', 'b'] * 10})

def SumSqDif(a):
     return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2 )

# Starting values
startVal = np.ones(2)*(1/2)

#Constraint  Sum of coefficients = 0
cons = ({'type':'eq', 'fun': lambda x: 1 - sum(x)})

# Bounds on coefficients
bnds = tuple([0,1] for x in startVal)

# Solve the optimization problem using the full dataframe (disregarding groups)
from scipy.optimize import minimize
Result = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
Result.x

然后我尝试使用数据框group和apply()。但是我得到的错误是

TypeError：不可用类型：'numpy.ndarray'。

# Try to Solve the optimization problem By group
# Create GroupBy object
grp_grpVar = df.groupby('grpVar')

def RunMinimize(data):
    ResultByGrp = minimize(SumSqDif, startVal , method='SLSQP' , bounds=bnds , constraints = cons )
    return ResultByGrp.x

grp_grpVar.apply(RunMinimize(df))

这可能可以通过循环迭代来完成，但是我的实际数据包含大约7000万个组，我认为数据帧分组和apply()会更有效。我是Python的新手。我搜索了此网站和其他网站但找不到任何数据框apply()和scipy.optimize.minimize的示例。任何想法将不胜感激？

Answer 1

我相信你想要的是：

# add df parameter to your `SumSqDif` function signature, so that when you apply
# this function to your grouped by dataframe, the groups gets passed
# as the df argument to this function
def SumSqDif(a, df):
    return np.sum((df['y0'] - a[0]*df['x0'])**2 + (df['y1'] - a[1]*df['x1'])**2)

# add startVal, bnds, and cons as additional parameters 
# The way you wrote your function signature is that it
# uses these values from the global namespace, which is not good practice,
# because you're assuming these values exist in the global scope,
# which may not always be true
def RunMinimize(data, startVal, bnds, cons):
    # add additional argument of data into the minimize function
    # this passes the group as the df to SumSqDif
    ResultByGrp = minimize(SumSqDif, startVal, method='SLSQP',
                           bounds=bnds, constraints = cons, args=(data))
    return ResultByGrp.x

# Here, you're passing the startVal, bnds, and cons are arguments as
# additional keyword arguments to `apply`
df.groupby('grpVar').apply(RunMinimize, startVal=startVal, bnds=bnds, cons=cons))

Python scipy.optimize：如何按组运行多个单变量约束回归

1 个答案: