这个问题与https://stats.stackexchange.com/questions/3143/linear-model-with-constraints有关,但情况略有不同。
我有一个简单的双因素线性模型,具有连续结果Y
。 factor1
具有约350个分类值,factor2
具有相同的~350个类别。我想约束每个级别上的系数,将这两个因子加到之间。
(原因是factor1
和factor2
的每个级别在任何训练示例中都是正面或负面输入,但在同一个示例中从不出现两次。)
以下是一个示例情况的示例数据集,其中每个因素有四个级别:
Y factor1 factor2
1 -1.2470416 A B
2 4.3368592 C D
3 1.0005147 D A
4 -2.8309146 A C
5 1.7501315 B D
6 -0.8372193 B A
7 3.3542627 C A
8 4.3319422 D C
9 1.4937895 D B
10 2.0951559 A D
11 -2.6610207 C D
12 -4.9917367 D B
13 2.2424169 D A
14 1.0205409 C A
15 -3.4584576 C B
我想估计的统计模型是:
$$ y _ {(i,j)} = \ alpha_i- \ beta_j + \ varepsilon _ {(i,j)} $$
其中$(i,j)$是取决于该对的结果。 factor1
标记$ i $,factor2
标记$ j $。如果小组A
出现在factor2
中,A
上的参数应该等于factor1
中显示的负数。因此,我想为所有$ i $和$ j $设置$ \ alpha $等于$ \ beta $。
我可以很容易地估算lm()
中此模型的(荒谬)版本,如下所示:
Y <- c( -1.2470416, 4.3368592 , 1.0005147 , -2.8309146 , 1.7501315 , -0.8372193 , 3.3542627 , 4.3319422 , 1.4937895 , 2.0951559 , -2.6610 207 , -4.9917367 , 2.2424169 , 1.0205409 , -3.4584576 )
factor1 <- c( "A" , "C" , "D" , "A" , "B" , "B" , "C" , "D" , "D" , "A" , "C" , "D" , "D" , "C" , "C")
factor2 <- c( "B", "D", "A", "C", "D", "A", "A", "C", "B", "D", "D", "B", "A", "A", "B")
DF <- data.frame(Y,factor1,factor2)
lm(Y~factor1+factor2,data=DF)
我得到以下输出:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5363 2.5856 0.207 0.841
factor1B -0.4579 3.1121 -0.147 0.887
factor1C 0.4047 2.4925 0.162 0.875
factor1D 1.8737 2.4098 0.778 0.459
factor2B -3.6252 2.2050 -1.644 0.139
factor2C -0.7226 2.8903 -0.250 0.809
factor2D 0.7561 2.2094 0.342 0.741
请注意,理论上,factor1C
应该等于我的模型所指示的-factor2C
。在简单的lm()
输出中不是这种情况,因为我没有施加任何约束。
所以我想做的是估计
Y ~ factor1 + factor2 [subject to factor1+factor2=0 for each level of factor1, factor2]
用简单的英语,这就像
model2 <- lm(Y~factor1-factor2, data=DF)
但这当然不是R如何解释该表达式(因为在model
语句中加一个减号告诉R从模型中排除该变量。)
我已经阅读了对比,但我认为没有办法做到这一点。我也读过glmc
,但没有看到将其纳入具有这么多级别的因素的简单方法。此外,我不清楚生成新的factor3 = factor1-factor2
是针对此特定方案的明确定义的操作。最后,我尝试运行model3 <- lm(Y+factor2 ~ factor1, data=DF)
但收到错误。
我的感觉是我需要通过循环遍历每个变量的级别来创建约束矩阵。我对R来说是新手,我不确定这是怎么做到的。任何帮助将不胜感激。
请注意,在Stata中执行此操作非常容易,如下所示:
input ID y factor1 factor2
1 -1.2470416 1 2
2 4.3368592 3 4
3 1.0005147 4 1
4 -2.8309146 1 3
5 1.7501315 2 4
6 -0.8372193 2 1
7 3.3542627 3 1
8 4.3319422 4 3
9 1.4937895 4 2
10 2.0951559 1 4
11 -2.6610207 3 4
12 -4.9917367 4 2
13 2.2424169 4 1
14 1.0205409 3 1
15 -3.4584576 3 2
end
constraint 1 2.factor1 = -2.factor2
constraint 2 3.factor1 = -3.factor2
constraint 3 4.factor1 = -4.factor2
cnsreg y i.factor1 i.factor2, constraints(1/3)
给出以下输出:
Constrained linear regression Number of obs = 15
F( 3, 11) = 0.73
Prob > F = 0.5554
Root MSE = 2.9875
( 1) 2.factor1 + 2.factor2 = 0
( 2) 3.factor1 + 3.factor2 = 0
( 3) 4.factor1 + 4.factor2 = 0
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
factor1 |
B | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
C | .5222649 1.377463 0.38 0.712 -2.509511 3.55404
D | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
|
factor2 |
B | -2.104393 1.439085 -1.46 0.172 -5.271798 1.063011
C | -.5222649 1.377463 -0.38 0.712 -3.55404 2.509511
D | -.6589209 1.266188 -0.52 0.613 -3.445783 2.127941
|
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
如何在R中完成上述操作?
答案 0 :(得分:0)
正如https://stats.stackexchange.com/questions/3143/linear-model-with-constraints中最受欢迎(但未被接受)的回答所述,通过创建一个新变量可以轻松解决这个问题,这是“一热”编码因素的差异。
在Stata中,可以按如下方式执行此操作:
* one-hot encode each of the factors
qui tab factor1, gen(f1dum)
qui tab factor2, gen(f2dum)
* generate difference in one-hot vectors
forv x=1/4{
gen fdiffdum`x' = f1dum`x'-f2dum`x'
}
* regress y on differenced one-hot vectors
reg y fdiffdum2 fdiffdum3 fdiffdum4
其中给出了以下输出:
Source | SS df MS Number of obs = 15
-------------+---------------------------------- F(3, 11) = 0.73
Model | 19.5429062 3 6.51430205 Prob > F = 0.5554
Residual | 98.1766922 11 8.92515383 R-squared = 0.1660
-------------+---------------------------------- Adj R-squared = -0.0614
Total | 117.719598 14 8.40854274 Root MSE = 2.9875
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fdiffdum2 | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
fdiffdum3 | .5222648 1.377463 0.38 0.712 -2.509511 3.55404
fdiffdum4 | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
在R中,人们可以这样做:
factor1mat <- model.matrix(~factor1, DF)
factor2mat <- model.matrix(~factor2, DF)
factordiffmat <- factor1mat - factor2mat
summary(lm(Y~factordiffmat, data=DF))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5055 0.8297 0.609 0.555
factordiffmat(Intercept) NA NA NA NA
factordiffmatfactor1B 2.1044 1.4391 1.462 0.172
factordiffmatfactor1C 0.5223 1.3775 0.379 0.712
factordiffmatfactor1D 0.6589 1.2662 0.520 0.613