Python:与Stata(固定效果假人)相比回归缓慢

时间:2018-03-29 16:03:22

标签: python regression stata dummy-variable

我试图在Python中运行回归,但它只需要很长时间并停止运行。在Stata它可以工作,只需要几秒钟。

这是由于分类列,包括组固定效应。 如果没有这个变量,Stata和Python的性能就相当平均,大约需要1秒才能完成200,000次观测:

Code Stata

reg income height Number_children

Code Python

model = smf.ols(income ~ height + Number_children, data=humans).fit() 

添加虚拟对象,我将Stata代码更改为areg

areg income height Number_children, absorb(Village)

比没有假人只需要1-2秒。

在Python中:

model = smf.ols(income ~ height + Number_children + Village, data=humans).fit()

其中:

Name: Village, dtype: category
Categories (3678, object):

等待2分钟后我停止回归。 有没有想法如何让代码运行,并将速度提高到几乎与Stata一样快?问题是由变量还是由回归命令引起的?

  • 编辑:

基于Dimitriy的回应,我尝试了所有变量:

例如:

humans["income_gr_m"]= humans["income"].groupby(humans['Village']).mean()
humans["income_star"] = humans["income"] - humans["income_gr_m"] + humans["income"].mean()

然而,这也使Python工作至少2分钟(我再次停止)。或者应该以不同的方式执行转换?谢谢

1 个答案:

答案 0 :(得分:3)

areg实际上并没有像使用Python那样用3,677个村庄指标反转那个矩阵。它正在以一种方式转换数据,以避免这样做的需要,因此它会更快。这也是regress与村庄假人的常数与areg的常数不匹配的原因,但是如果等待Python完成,斜率系数应该相同。

以下aregregress计算系数的方法。标准误差太大,因为我没有对5个吸收效果进行自由度调整,但我会通过乘以SE在循环中手动进行调整:

. sysuse auto, clear
(1978 Automobile Data)

. drop if missing(rep78)
(5 observations deleted)

. /* (1) transform the data by subtracting the group specific mean and */
. /* adding the grand/overall mean back in for outcome and regressors */
. foreach var of varlist price weight length foreign {
  2.         bys rep78: egen group_mean = mean(`var')
  3.         qui sum `var'
  4.         gen double `var'_star = `var' - group_mean + r(mean)
  5.         drop group_mean
  6. }

. /* (2) Fit the model on transformed data */
. regress price_star weight_star length_star foreign_star

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(3, 65)        =     26.99
       Model |   315296838         3   105098946   Prob > F        =    0.0000
    Residual |   253139578        65  3894455.05   R-squared       =    0.5547
-------------+----------------------------------   Adj R-squared   =    0.5341
       Total |   568436416        68  8359359.06   Root MSE        =    1973.4

------------------------------------------------------------------------------
  price_star |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 weight_star |    6.15521   1.008605     6.10   0.000     4.140885    8.169534
 length_star |  -100.9268   33.82508    -2.98   0.004    -168.4801   -33.37341
foreign_star |   3394.052    782.454     4.34   0.000     1831.383     4956.72
       _cons |   5453.782   3829.487     1.42   0.159    -2194.232     13101.8
------------------------------------------------------------------------------

. /* (3) Adjust the SEs for DoF */
. foreach coef in weight_star length_star foreign_star _cons {
  2.         di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
  3. }
Adjusted SE for weight_star:  1.041149
Adjusted SE for length_star:  34.91649
Adjusted SE for foreign_star:  807.7009
Adjusted SE for _cons:   3953.05

. /* (4) Make sure areg gives the same output */
. areg price weight length foreign, absorb(rep78)

Linear regression, absorbing indicators         Number of obs     =         69
                                                F(   3,     61)   =      25.33
                                                Prob > F          =     0.0000
                                                R-squared         =     0.5611
                                                Adj R-squared     =     0.5108
                                                Root MSE          =  2037.1129

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |    6.15521   1.041149     5.91   0.000     4.073303    8.237116
      length |  -100.9268   34.91649    -2.89   0.005    -170.7466   -31.10692
     foreign |   3394.052   807.7009     4.20   0.000     1778.954    5009.149
       _cons |   5453.782    3953.05     1.38   0.173    -2450.831    13358.39
-------------+----------------------------------------------------------------
       rep78 |          F(4, 61) =      0.261   0.902           (5 categories)

Stata代码:

cls
sysuse auto, clear
drop if missing(rep78)
/* (1) transform the data by subtracting the group specific mean and */
/* adding the grand/overall mean back in for outcome and regressors */
foreach var of varlist price weight length foreign {
    bys rep78: egen group_mean = mean(`var')
    qui sum `var'
    gen double `var'_star = `var' - group_mean + r(mean)
    drop group_mean
}
/* (2) Fit the model on transformed data */
regress price_star weight_star length_star foreign_star
/* (3) Adjust the SEs for DoF */
foreach coef in weight_star length_star foreign_star _cons {
    di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
}
/* (4) Make sure areg gives the same output */
areg price weight length foreign, absorb(rep78)