基于Groupby的线性回归

时间:2016-02-21 21:23:33

标签: python pandas scipy statsmodels

我有这样的df:

Allotment   Year    NDVI     A_Annex    Bachelor
A_Annex     1984    1.0      0.40       0.60
A_Annex     1984    1.5      0.56       0.89
A_Annex     1984    2.0      0.78       0.76
A_Annex     1985    3.4      0.89       0.54
A_Annex     1985    1.6      0.98       0.66
A_Annex     1986    2.5      1.10       0.44
A_Annex     1986    1.7      0.87       0.65
Bachelor    1984    8.9      0.40       0.60
Bachelor    1984    6.5      0.56       0.89
Bachelor    1984    4.2      0.78       0.76
Bachelor    1985    2.4      0.89       0.54
Bachelor    1985    1.7      0.98       0.66
Bachelor    1986    8.9      1.10       0.44
Bachelor    1986    9.6      0.87       0.65

我想基于groupby运行回归。我想要回归每个唯一的Allotment及其NDVI值及其关联列。因此,我想使用A_Annex Allotment及其关联的A_Annex对专栏NDVI进行回归。然后我想用Bachelor做同样的事情。基本上我希望将列与关联的Allotment匹配,然后使用相应的NDVI值对列中的值进行回归。

我可以为这样的一个分配做到这一点:

stat=merge.groupby(['Allotment']).apply(lambda x: sp.stats.linregress(x['A_Annex'], x['NDVI']))

但是我需要继续更改sp.stats.linregress(x['A_Annex'], x['NDVI']))中的x值,我想避免这种情况。

1 个答案:

答案 0 :(得分:1)

你是否经历过这样的事情?

r = {annex: pd.ols(x=group['A_Annex'], y=group['NDVI']) 
     for annex, group in df.groupby('Allotment')}
>>> r

{'A_Annex': 
 -------------------------Summary of Regression Analysis-------------------------

 Formula: Y ~ <x> + <intercept>

 Number of Observations:         7
 Number of Degrees of Freedom:   2

 R-squared:         0.3774
 Adj R-squared:     0.2529

 Rmse:              0.6785

 F-stat (1, 5):     3.0307, p-value:     0.1422

 Degrees of Freedom: model 1, resid 5

 -----------------------Summary of Estimated Coefficients------------------------
       Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
 --------------------------------------------------------------------------------
              x     1.9871     1.1415       1.74     0.1422    -0.2501     4.2244
      intercept     0.3731     0.9454       0.39     0.7094    -1.4798     2.2260
 ---------------------------------End of Summary---------------------------------,
 'Bachelor': 
 -------------------------Summary of Regression Analysis-------------------------

 Formula: Y ~ <x> + <intercept>

 Number of Observations:         7
 Number of Degrees of Freedom:   2

 R-squared:         0.0650
 Adj R-squared:    -0.1220

 Rmse:              3.4787

 F-stat (1, 5):     0.3478, p-value:     0.5810

 Degrees of Freedom: model 1, resid 5

 -----------------------Summary of Estimated Coefficients------------------------
       Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
 --------------------------------------------------------------------------------
              x    -3.4511     5.8522      -0.59     0.5810   -14.9213     8.0191
      intercept     8.7796     4.8467       1.81     0.1298    -0.7200    18.2792
 ---------------------------------End of Summary---------------------------------}

然后您可以按如下方式提取模型参数:

>>> {k: r[k].sm_ols.params for k in r}
{'A_Annex': array([ 1.9871432 ,  0.37310585]),
 'Bachelor': array([-3.45111992,  8.77960702])}