具有两个因子变量

时间:2017-04-05 14:10:32

标签: r constraints lm categorical-data

这个问题与https://stats.stackexchange.com/questions/3143/linear-model-with-constraints有关,但情况略有不同。

我有一个简单的双因素线性模型,具有连续结果Yfactor1具有约350个分类值,factor2具有相同的~350个类别。我想约束每个级别上的系数,将这两个因子加到之间。

(原因是factor1factor2的每个级别在任何训练示例中都是正面或负面输入,但在同一个示例中从不出现两次。)

以下是一个示例情况的示例数据集,其中每个因素有四个级别:

            Y factor1 factor2
1  -1.2470416       A       B
2   4.3368592       C       D
3   1.0005147       D       A
4  -2.8309146       A       C
5   1.7501315       B       D
6  -0.8372193       B       A
7   3.3542627       C       A
8   4.3319422       D       C
9   1.4937895       D       B
10  2.0951559       A       D
11 -2.6610207       C       D
12 -4.9917367       D       B
13  2.2424169       D       A
14  1.0205409       C       A
15 -3.4584576       C       B

我想估计的统计模型是: $$ y _ {(i,j)} = \ alpha_i- \ beta_j + \ varepsilon _ {(i,j)} $$ 其中$(i,j)$是取决于该对的结果。 factor1标记$ i $,factor2标记$ j $。如果小组A出现在factor2中,A上的参数应该等于factor1中显示的负数。因此,我想为所有$ i $和$ j $设置$ \ alpha $等于$ \ beta $。

我可以很容易地估算lm()中此模型的(荒谬)版本,如下所示:

Y <- c( -1.2470416, 4.3368592 , 1.0005147 , -2.8309146 , 1.7501315 , -0.8372193 , 3.3542627 , 4.3319422 , 1.4937895 , 2.0951559 , -2.6610    207 , -4.9917367 , 2.2424169 , 1.0205409 , -3.4584576 )
factor1 <- c( "A" , "C" , "D" , "A" , "B" , "B" , "C" , "D" , "D" , "A" , "C" , "D" , "D" , "C" , "C")
factor2 <- c( "B", "D", "A", "C", "D", "A", "A", "C", "B", "D", "D", "B", "A", "A", "B")
DF <- data.frame(Y,factor1,factor2)

lm(Y~factor1+factor2,data=DF)

我得到以下输出:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.5363     2.5856   0.207    0.841
factor1B     -0.4579     3.1121  -0.147    0.887
factor1C      0.4047     2.4925   0.162    0.875
factor1D      1.8737     2.4098   0.778    0.459
factor2B     -3.6252     2.2050  -1.644    0.139
factor2C     -0.7226     2.8903  -0.250    0.809
factor2D      0.7561     2.2094   0.342    0.741

请注意,理论上,factor1C应该等于我的模型所指示的-factor2C。在简单的lm()输出中不是这种情况,因为我没有施加任何约束。

所以我想做的是估计

Y ~ factor1 + factor2  [subject to factor1+factor2=0 for each level of factor1, factor2]

用简单的英语,这就像

model2 <- lm(Y~factor1-factor2, data=DF)

但这当然不是R如何解释该表达式(因为在model语句中加一个减号告诉R从模型中排除该变量。)

我已经阅读了对比,但我认为没有办法做到这一点。我也读过glmc,但没有看到将其纳入具有这么多级别的因素的简单方法。此外,我不清楚生成新的factor3 = factor1-factor2是针对此特定方案的明确定义的操作。最后,我尝试运行model3 <- lm(Y+factor2 ~ factor1, data=DF)但收到错误。

我的感觉是我需要通过循环遍历每个变量的级别来创建约束矩阵。我对R来说是新手,我不确定这是怎么做到的。任何帮助将不胜感激。

请注意,在Stata中执行此操作非常容易,如下所示:

input ID  y factor1 factor2
1  -1.2470416       1       2
2   4.3368592       3       4
3   1.0005147       4       1
4  -2.8309146       1       3
5   1.7501315       2       4
6  -0.8372193       2       1
7   3.3542627       3       1
8   4.3319422       4       3
9   1.4937895       4       2
10  2.0951559       1       4
11 -2.6610207       3       4
12 -4.9917367       4       2
13  2.2424169       4       1
14  1.0205409       3       1
15 -3.4584576       3       2
end


constraint   1 2.factor1 = -2.factor2
constraint   2 3.factor1 = -3.factor2
constraint   3 4.factor1 = -4.factor2
cnsreg y i.factor1 i.factor2, constraints(1/3)

给出以下输出:

Constrained linear regression                   Number of obs     =         15
                                                F(   3,     11)   =       0.73
                                                Prob > F          =     0.5554
                                                Root MSE          =     2.9875

 ( 1)  2.factor1 + 2.factor2 = 0
 ( 2)  3.factor1 + 3.factor2 = 0
 ( 3)  4.factor1 + 4.factor2 = 0
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     factor1 |
          B  |   2.104393   1.439085     1.46   0.172    -1.063011    5.271798
          C  |   .5222649   1.377463     0.38   0.712    -2.509511     3.55404
          D  |   .6589209   1.266188     0.52   0.613    -2.127941    3.445783
             |
     factor2 |
          B  |  -2.104393   1.439085    -1.46   0.172    -5.271798    1.063011
          C  |  -.5222649   1.377463    -0.38   0.712     -3.55404    2.509511
          D  |  -.6589209   1.266188    -0.52   0.613    -3.445783    2.127941
             |
       _cons |   .5054862    .829675     0.61   0.555    -1.320616    2.331589
------------------------------------------------------------------------------

如何在R中完成上述操作?

1 个答案:

答案 0 :(得分:0)

正如https://stats.stackexchange.com/questions/3143/linear-model-with-constraints中最受欢迎(但未被接受)的回答所述,通过创建一个新变量可以轻松解决这个问题,这是“一热”编码因素的差异。

在Stata中,可以按如下方式执行此操作:

* one-hot encode each of the factors
qui tab factor1, gen(f1dum)
qui tab factor2, gen(f2dum)

* generate difference in one-hot vectors
forv x=1/4{
    gen fdiffdum`x' = f1dum`x'-f2dum`x'
}

* regress y on differenced one-hot vectors
reg y fdiffdum2 fdiffdum3 fdiffdum4

其中给出了以下输出:

      Source |       SS           df       MS      Number of obs   =        15
-------------+----------------------------------   F(3, 11)        =      0.73
       Model |  19.5429062         3  6.51430205   Prob > F        =    0.5554
    Residual |  98.1766922        11  8.92515383   R-squared       =    0.1660
-------------+----------------------------------   Adj R-squared   =   -0.0614
       Total |  117.719598        14  8.40854274   Root MSE        =    2.9875

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   fdiffdum2 |   2.104393   1.439085     1.46   0.172    -1.063011    5.271798
   fdiffdum3 |   .5222648   1.377463     0.38   0.712    -2.509511     3.55404
   fdiffdum4 |   .6589209   1.266188     0.52   0.613    -2.127941    3.445783
       _cons |   .5054862    .829675     0.61   0.555    -1.320616    2.331589
------------------------------------------------------------------------------

在R中,人们可以这样做:

factor1mat <- model.matrix(~factor1, DF)
factor2mat <- model.matrix(~factor2, DF)

factordiffmat <- factor1mat - factor2mat

summary(lm(Y~factordiffmat, data=DF))

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                0.5055     0.8297   0.609    0.555
factordiffmat(Intercept)       NA         NA      NA       NA
factordiffmatfactor1B      2.1044     1.4391   1.462    0.172
factordiffmatfactor1C      0.5223     1.3775   0.379    0.712
factordiffmatfactor1D      0.6589     1.2662   0.520    0.613