Question

假设我有DataFrame一列y变量和多列x变量。我希望能够针对y vs x1，y vs x2，...等运行多个单变量回归，并将预测存储回{ {1}}。另外，我需要通过组变量来做到这一点。

DataFrame

上面的代码显然不起作用。在import statsmodels.api as sm import pandas as pd df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return sm.OLS(y, x).fit().predict() df.groupby('grp').apply(ols_res) # This does not work遍历y列apply列x，x1时，我不清楚如何正确地将固定x2传递给函数。 ..）。我怀疑可能有一个非常聪明的单线解决方案来做到这一点。有什么想法吗？

Answer 1

传递给apply的函数必须以pandas.DataFrame作为第一个参数。您可以将其他关键字或位置参数传递给传递给应用函数的apply。所以你的例子可以进行一些小修改。将ols_res更改为

def ols_res(df, xcols,  ycol):
    return sm.OLS(df[ycol], df[xcols]).fit().predict()

然后，您可以像这样使用groupby和apply

df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')

或者

df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')

修改

以上代码不运行多个单变量回归。相反，它每组运行一次多变量回归。然而，（另一个）稍作修改就会。

def ols_res(df, xcols, ycol): return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})

编辑2

虽然上述解决方案有效，但我认为以下是一些pandas-y

import statsmodels.api as sm import pandas as pd import numpy as np df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return pd.Series(sm.OLS(y, x).fit().predict()) df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))

出于某种原因，如果我最初定义ols_res()，则结果DataFrame在索引中没有组标签。

Python pandas：如何按组运行多个单变量回归

1 个答案: