`NA`系数

Question

我试图理解R如何确定线性模型中交互的参考组。请考虑以下事项：

df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"), 
    year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"), 
    treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"), 
    y = c(1.4068142116718, 2.67041187927052, 2.69166439169131, 
    3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572, 
    5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286, 
    3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795, 
    5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801, 
    5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279, 
    5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312, 
    8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id", 
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")


lm(y ~ -1 + id + year + year:treatment, df)

#Coefficients:
#             id1               id2               id3               id4  
#          2.6585            3.9933            4.1161            5.3544  
#             id5             year2  year1:treatment1  year2:treatment1  
#          6.1991            0.7149           -0.6317                NA

R尝试估计完整的交互集，而不是一致地省略参考组。结果，我在结果中得到了NA。

此外，R与其丢弃的组不一致。我想在主效应和交互中估计一个具有相同省略组（year1）的模型。如何强制R从前面的模型中省略year1和year1:treatment1？

我知道这个问题有几种解决方法（例如，手动创建所有变量并将其写出模型的公式）。但我估计的实际模型是这个问题的更复杂版本，这样的解决方法会很麻烦。

Answer 1

R尝试估计完整的交互集，而不是一致地省略参考组。结果，我在结果中得到NA。

这两者之间没有因果关系。你得到NA纯粹是因为你的变量有嵌套。

R与其丢弃的组不一致。我想在主效应和交互中估计一个具有相同省略组（year1）的模型。如何强制R从前面的模型中省略year1和year1:treatment1？

没有不一致，但model.matrix有自己的规则。你看起来很不稳定＆＃34;对比，因为你没有主效treatment，只有互动treatment:year。

在下文中，我将首先解释NA系数，然后是交互的对比，最后给出一些建议。

`NA`系数

默认情况下，对比度处理用于对比因子，默认情况下为contr.treatement，第一个因子级别作为参考级别。看看你的数据：

levels(df$id)
# [1] "1" "2" "3" "4" "5"

levels(df$year)
# [1] "1" "2"

levels(df$treatment)
# [1] "0" "1"

现在看一个简单的线性模型：

lm(y ~ id + year + treatment, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5        year2  
#      2.153        1.651        1.773        2.696        3.541        1.094  
# treatment1  
#         NA

您可以看到id1，year1和treatment0不存在，因为它们被视为参考。

即使没有互动，您已经拥有NA系数。这意味着treatment嵌套在span{id, year}中。当您进一步包含treatment:year时，此类嵌套仍然存在。

事实上，进一步的测试显示treatment嵌套在id中：

lm(y ~ id + treatment, df)

#    Coefficients:
#(Intercept)          id2          id3          id4          id5   treatment1  
#      2.700        1.651        1.773        2.696        3.541           NA

总之，变量treatment对于您的建模目的而言完全是多余的。如果您添加id，则无需包含treatment或treatment:*，其中*可以是任何变量。

当回归模型中有许多因子变量时，很容易进行嵌套。注意，对比不一定会删除嵌套，因为对比仅识别变量名称，但不识别潜在的数字特征。请参考以下示例，了解如何欺骗contr.treatment：

df$ID <- df$id
lm(y ~ id + ID, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5          ID2  
#      2.700        1.651        1.773        2.696        3.541           NA  
#        ID3          ID4          ID5  
#         NA           NA           NA

看，对比是按预期工作的，但是ID嵌套在id中，所以我们有等级缺陷。

互动对比

我们首先通过删除NA变量来消除id强加的噪音。然后，具有treatment和year的回归模型将是完全排名，因此如果对比成功，则不应该看到NA。

交互或高阶效果的对比取决于是否存在低阶效应。比较以下模型：

lm(y ~ year:treatment, df)  ## no low-order effects at all

#Coefficients:
#     (Intercept)  treatment0:year1  treatment1:year1  treatment0:year2  
#          5.4523           -1.3976           -1.3466           -0.6826  
#treatment1:year2  
#              NA  

lm(y ~ year + treatment + year:treatment, df)  ## with all low-order effects

#Coefficients:
#     (Intercept)        treatment1             year2  treatment1:year2  
#         4.05471           0.05094           0.71493           0.63170

在第一个模型中，没有进行对比，因为没有主效果，只有二阶效应。这里NA是由于没有对比。

现在考虑另一组例子，包括一些但不是所有的主要效果：

lm(y ~ year + year:treatment, df)  ## main effect `treatment` is missing

#Coefficients:
#     (Intercept)             year2  year1:treatment1  year2:treatment1  
#         4.05471           0.71493           0.05094           0.68264  

lm(y ~ treatment + year:treatment, df)  ## main effect `year` is missing

#Coefficients:
#     (Intercept)        treatment1  treatment0:year2  treatment1:year2  
#         4.05471           0.05094           0.71493           1.34663

在这里，相互作用的对比将发生在缺少主效应的变量上。例如，在第一个模型中，缺少主效果treatment，因此交互下降treatement0;而在第二个模型中，缺少主效果year，因此互动会下降year1。

一般准则

指定高阶效果时，始终包括所有低阶效果。这不仅提供易于理解的对比行为，而且还具有其他一些吸引人的统计原因。您可以阅读Cross Validated上的Including the interaction but not the main effects in a model。

另一个建议是，始终包括拦截。在线性回归中，具有截距的模型产生无偏估计，残差将具有0均值。

为什么我得到NA系数以及`lm`如何降低交互的参考水平

1 个答案:

`NA`系数

互动对比

一般准则