Question

我正在针对温度进行简单的比赛时间回归，以发展一些基本的直觉。我的数据集非常大，每次观察都是特定比赛中某个单位在某一年内的比赛结束时间。

对于初学者，我正在对温度箱进行非常简单的比赛时间回归。

临时变量摘要：

            |              
Variable    |   Obs     Mean      Std. Dev   Min    Max
------------+--------------------------------------------
avg_temp_scc|  8309434  54.3      9.4         0      89

时间变量摘要：

Variable    |   Obs     Mean      Std. Dev   Min    Max
------------+--------------------------------------------
chiptime    |  8309434  267.5      59.6     122      1262

我决定为温度和回归时间制作10度的箱子。

代码是：

    egen temp_trial = cut(avg_temp_scc), at(0,10,20,30,40,50,60,70,80,90)
    reg chiptime i.temp_trial

输出

  Source |       SS       df       MS              Number of obs = 8309434
---------+------------------------------           F(  8,8309425) =69509.83
   Model |  1.8525e+09     8   231557659           Prob > F      =  0.0000
Residual |  2.7681e+108309425  3331.29368           R-squared     =  0.0627
    -----+--------------------------------           Adj R-squared =  0.0627
   Total |  2.9534e+108309433  3554.22521           Root MSE      =  57.717



     chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ----------+----------------------------------------------------------------
    temp_trial |
           10  |  -26.63549   2.673903    -9.96   0.000    -31.87625   -21.39474
           20  |   10.23883   1.796236     5.70   0.000      6.71827    13.75939
           30  |   -16.1049   1.678432    -9.60   0.000    -19.39457   -12.81523
           40  |  -13.97918   1.675669    -8.34   0.000    -17.26343   -10.69493
           50  |  -10.18371   1.675546    -6.08   0.000    -13.46772   -6.899695
           60  |  -.6865365   1.675901    -0.41   0.682    -3.971243     2.59817
           70  |   44.42869   1.676883    26.49   0.000     41.14206    47.71532
           80  |   23.63064   1.766566    13.38   0.000     20.16824    27.09305
         _cons |   273.1366   1.675256   163.04   0.000     269.8531      276.42

因此，stata正确地放下了其中一个温度箱（在这种情况下为0-10）。

现在我手动创建了垃圾箱并再次运行回归：

    gen temp0 = 1 if temp_trial==0
    replace temp0 = 0 if temp_trial!=0

    gen temp1 = 1 if temp_trial == 10
    replace temp1 = 0 if temp_trial != 10

    gen temp2 = 1 if temp_trial==20
    replace temp2 = 0 if temp_trial!=20

    gen temp3 = 1 if temp_trial==30
    replace temp3 = 0 if temp_trial!=30

    gen temp4=1 if temp_trial==40
    replace temp4=0 if temp_trial!=40

    gen temp5=1 if temp_trial==50
    replace temp5=0 if temp_trial!=50

    gen temp6=1 if temp_trial==60
    replace temp6=0 if temp_trial!=60

    gen temp7=1 if temp_trial==70
    replace temp7=0 if temp_trial!=70

    gen temp8=1 if temp_trial==80
    replace temp8=0 if temp_trial!=80

    reg chiptime temp0 temp1 temp2 temp3 temp4 temp5 temp6 temp7 temp8

输出结果为：

     Source |       SS       df       MS              Number of obs = 8309434
   ---------+------------------------------           F(  9,8309424) =61786.51
      Model |  1.8525e+09     9   205829030           Prob > F      =  0.0000
   Residual |  2.7681e+108309424  3331.29408           R-squared     =  0.0627
    --------+------------------------------           Adj R-squared =  0.0627
      Total |  2.9534e+108309433  3554.22521           Root MSE      =  57.717


--------------------------------------------------------------------------
chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------+----------------------------------------------------------------
   temp0 |  -54.13245   6050.204    -0.01   0.993    -11912.32    11804.05
   temp1 |  -80.76794   6050.204    -0.01   0.989    -11938.95    11777.42
   temp2 |  -43.89362   6050.203    -0.01   0.994    -11902.08    11814.29
   temp3 |  -70.23735   6050.203    -0.01   0.991    -11928.42    11787.94
   temp4 |  -68.11162   6050.203    -0.01   0.991    -11926.29    11790.07
   temp5 |  -64.31615   6050.203    -0.01   0.992     -11922.5    11793.87
   temp6 |  -54.81898   6050.203    -0.01   0.993       -11913    11803.36
   temp7 |  -9.703755   6050.203    -0.00   0.999    -11867.89    11848.48
   temp8 |   -30.5018   6050.203    -0.01   0.996    -11888.68    11827.68
   _cons |    327.269   6050.203     0.05   0.957    -11530.91    12185.45

请注意，垃圾箱是整个数据集的详尽信息，而stata在回归中包含一个常量，并且没有任何垃圾箱被丢弃。这不是不正确的吗？鉴于常量被包含在回归中，不应该丢弃其中一个箱子以使其成为＆＃34;基础案例＆＃34 ;?我觉得好像我在这里遗漏了一些明显的东西。

编辑：这是数据和do文件的dropbox链接：它只包含正在考虑的两个变量。该文件是129 MB。我还在链接上有我的输出图片。

Answer 1

这可能不是一个“答案”，但评论太长了，所以我在这里写下来。

我的结果不同。在最后的回归中，一个变量被删除：

. clear all

. set obs 8309434
number of observations (_N) was 0, now 8,309,434

. set seed 1

. gen avg_temp_scc = floor(90*uniform())

. egen temp_trial = cut(avg_temp_scc), at(0,10,20,30,40,50,60,70,80,90)

. gen chiptime = rnormal()

. reg chiptime i.temp_trial

      Source |       SS           df       MS      Number of obs   = 8,309,434
-------------+----------------------------------   F(8, 8309425)   =      0.88
       Model |  7.07729775         8  .884662219   Prob > F        =    0.5282
    Residual |   8308356.5 8,309,425  .999871411   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |  8308363.58 8,309,433    .9998713   Root MSE        =    .99994

------------------------------------------------------------------------------
    chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  temp_trial |
         10  |   .0010732   .0014715     0.73   0.466    -.0018109    .0039573
         20  |   .0003255   .0014713     0.22   0.825    -.0025581    .0032092
         30  |   .0017061   .0014713     1.16   0.246    -.0011776    .0045897
         40  |   .0003128   .0014717     0.21   0.832    -.0025718    .0031973
         50  |   .0007142   .0014715     0.49   0.627    -.0021699    .0035983
         60  |   .0021693   .0014716     1.47   0.140    -.0007149    .0050535
         70  |  -.0008265   .0014715    -0.56   0.574    -.0037107    .0020577
         80  |  -.0005001   .0014714    -0.34   0.734    -.0033839    .0023837
             |
       _cons |  -.0006364   .0010403    -0.61   0.541    -.0026753    .0014025
------------------------------------------------------------------------------

. * "qui tab temp_trial, gen(temp)" is more convenient than "forv ..."
. forv k = 0/8 {
  2. gen temp`k' = temp_trial==`k'0
  3. }

. reg chiptime temp0-temp8
note: temp6 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   = 8,309,434
-------------+----------------------------------   F(8, 8309425)   =      0.88
       Model |  7.07729775         8  .884662219   Prob > F        =    0.5282
    Residual |   8308356.5 8,309,425  .999871411   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |  8308363.58 8,309,433    .9998713   Root MSE        =    .99994

------------------------------------------------------------------------------
    chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       temp0 |  -.0021693   .0014716    -1.47   0.140    -.0050535    .0007149
       temp1 |  -.0010961   .0014719    -0.74   0.456     -.003981    .0017888
       temp2 |  -.0018438   .0014717    -1.25   0.210    -.0047282    .0010407
       temp3 |  -.0004633   .0014717    -0.31   0.753    -.0033477    .0024211
       temp4 |  -.0018566   .0014721    -1.26   0.207    -.0047419    .0010287
       temp5 |  -.0014551   .0014719    -0.99   0.323      -.00434    .0014298
       temp6 |          0  (omitted)
       temp7 |  -.0029958   .0014719    -2.04   0.042    -.0058808   -.0001108
       temp8 |  -.0026694   .0014718    -1.81   0.070     -.005554    .0002152
       _cons |   .0015329   .0010408     1.47   0.141    -.0005071    .0035729
------------------------------------------------------------------------------

与你的不同之处在于：（i）不同的数据（我生成随机数），（ii）我使用forvalue循环而不是手动变量创建。但是，我发现您的代码中没有错误。

Answer 2

这也不是一个答案，而是一个扩展的评论，因为我已经厌倦了与600个字符的限制并且在5分钟后冻结编辑。

在原帖上的评论帖子中，@ user52932写了

感谢您验证这一点。你能详细说明一下究竟是什么吗？精度问题是？这只会导致问题吗？多重共线性问题？可能是因为我在使用因子变量这个精度问题可能会导致我的估计错误吗？

我希望毫不含糊地说，使用因子变量的回归结果与任何明确指定的回归结果一样正确。

在使用虚拟变量的回归中，模型被错误指定为包含一组多线性变量。然后Stata因未能检测到多重共线性而出现故障。

但是对于多重共线性没有神奇的考验。它是从交叉积矩阵的特征推断出来的。在这种情况下，交叉积矩阵表示830万个观测值，尽管Stata全程使用了双精度，但计算后的矩阵通过了Stata测试，未被检测为包含多重线性变量集。这是我提到的精度问题的轨迹。请注意，通过重新排序观察结果，累积的交叉积矩阵的差异足以使其现在无法通过Stata测试，并且检测到错误指定。

现在看看从这个错误指定的回归中获得的原始帖子中的结果。请注意，如果将54.13245添加到每个虚拟变量的系数上并从常量中减去相同的量，则得到的系数和常数与使用因子变量的回归中的系数和常数相同。这是多重共线性问题的教科书定义 - 不是系数估计是错误的，而是系数估计不是唯一定义的。

在上面的评论中，@ user52932写了

我不确定Stata在我的数据中使用什么作为基本情况。

答案是Stata没有使用基本情况;当一组多线性变量包含在自变量中时，结果就是预期的结果。

所以这个问题提醒我们，像Stata这样的统计软件包不能无形地检测多重共线性。事实证明，这是因子变量符号天才的一部分，我现在意识到了。使用因子变量表示法，您可以告诉Stata创建一组虚拟变量，根据定义它们将是多重线性的，并且因为它理解虚拟变量之间的关系，所以它可以在构造之前消除多重共线性事前交叉产品矩阵，而不是尝试使用交叉产品矩阵的特征来推断事后的问题。

我们不应该对Stata偶尔未能发现多重共线性感到惊讶，而是感到满意的是它的确如此。毕竟，第二个模型确实是一个错误指定，它构成了对用户部分OLS回归假设的明确违反。

由于多重共线性，Stata没有丢弃变量（在回归中），我认为它应该

2 个答案: