我正在建立一个具有相当大量数据的回归模型(2146个观测值)。这些是重复测量,所以我将使用混合模型,但是,我总是喜欢从更简单的模型开始,以帮助查看数据的样子。问题是我的回归系数对我来说没有意义,而且我无法弄清楚为什么它们在添加到模型时会发生如此剧烈的变化。
以下是第一个简单回归模型的示例:
fit1 <- lm(Outcome.Variable ~ Group, data = dat)
summary(fit1)
Call:
lm(formula = Outcome.Variable ~ Group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-225.63 -75.96 -4.60 67.78 356.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 364.104 4.677 77.847 < 2e-16 ***
GroupB -65.187 7.268 -8.969 < 2e-16 ***
GroupC -31.776 6.982 -4.551 5.63e-06 ***
GroupD -37.268 6.337 -5.881 4.73e-09 ***
GroupE -11.172 7.661 -1.458 0.144902
GroupF -29.707 8.188 -3.628 0.000292 ***
GroupG -10.443 6.963 -1.500 0.133853
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 91.42 on 2139 degrees of freedom
Multiple R-squared: 0.0464, Adjusted R-squared: 0.04372
F-statistic: 17.35 on 6 and 2139 DF, p-value: < 2.2e-16
这些系数对我来说是有意义的,因为截距是GroupA的平均值,并且每个其他组的估计值代表与GroupA的差异。快速检查数据表明这种解释是正确的:
library(dplyr)
dat %>%
group_by(Group) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 7 × 2
Group Outcome.Variable.Mean
<chr> <dbl>
1 A 364.1045
2 B 298.9173
3 C 332.3286
4 D 326.8360
5 E 352.9324
6 F 334.3972
7 G 353.6617
我可以使用我的第二个变量Day:
构建另一个简单的线性回归fit2 <- lm(Outcome.Variable ~ Day, data = dat)
summary(fit2)
Call:
lm(formula = Outcome.Variable ~ Day, data = dat)
Residuals:
Min 1Q Median 3Q Max
-228.56 -43.45 -4.70 44.41 321.77
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 388.003 2.598 149.367 <2e-16 ***
Day2 -5.278 3.668 -1.439 0.15
Day3 -136.108 3.589 -37.921 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 68.28 on 2143 degrees of freedom
Multiple R-squared: 0.4669, Adjusted R-squared: 0.4664
F-statistic: 938.6 on 2 and 2143 DF, p-value: < 2.2e-16
同样,快速检查数据表明这些回归系数被正确解释:
dat %>%
group_by(Day) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 3 × 2
Day Outcome.Variable.Mean
<fctr> <dbl>
1 1 388.0027
2 2 382.7242
3 3 251.8942
现在,当我将两者合并到模型中时,问题出现了:
fit3 <- lm(Outcome.Variable ~ Day + Group, data = dat)
summary(fit3)
Call:
lm(formula = Outcome.Variable ~ Day + Group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-212.456 -43.442 -2.864 41.000 305.607
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 413.942 3.912 105.806 < 2e-16 ***
Day2 -5.801 3.504 -1.656 0.0979 .
Day3 -136.663 3.429 -39.859 < 2e-16 ***
GroupB -66.126 5.185 -12.753 < 2e-16 ***
GroupC -31.813 4.980 -6.388 2.06e-10 ***
GroupD -37.654 4.521 -8.329 < 2e-16 ***
GroupE -9.777 5.465 -1.789 0.0738 .
GroupF -24.570 5.842 -4.206 2.71e-05 ***
GroupG -10.067 4.967 -2.027 0.0428 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 65.21 on 2137 degrees of freedom
Multiple R-squared: 0.5152, Adjusted R-squared: 0.5134
F-statistic: 283.9 on 8 and 2137 DF, p-value: < 2.2e-16
这些回归系数对我来说没有意义。拦截应该是第1天GroupA的平均值,但是,对数据的检查表明情况并非如此:
as.data.frame(dat %>%
group_by(Day, Group) %>%
summarize(Outcome.Variable.Mean = mean(Outcome.Variable)))
Day Group Outcome.Variable.Mean
1 1 A 420.5681
2 1 B 331.6633
3 1 C 380.9213
4 1 D 382.2743
5 1 E 405.1115
6 1 F 392.5020
7 1 G 400.5005
8 2 A 405.3756
9 2 B 339.2346
10 2 C 389.3252
11 2 D 374.0798
12 2 E 388.7488
13 2 F 377.9685
14 2 G 395.5381
15 3 A 273.7767
16 3 B 229.6742
17 3 C 234.4119
18 3 D 230.6635
19 3 E 275.2313
20 3 F 254.7107
21 3 G 272.6063
这里发生了什么?我不想在没有先了解这个更基本的模型中发生的事情的情况下进入混合模型。为什么截距不代表第一天的GroupA平均值?即使拦截与其他估计之间的差异也不正确。例如,拦截与第2天之间的差异为-5.8。但是,第1天的GroupA和第2天的GroupA之间的差异是15分。
任何帮助了解这里发生的事情将不胜感激。
答案 0 :(得分:4)
您忽略了条款之间的互动。让我演示使用mtcars
数据:
首先,我运行回归disp ~ factor(cyl)
(我必须调用factor
,因为默认情况下mtcars
中的所有变量都是数字的):
library(dplyr)
lm(disp ~ factor(cyl), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(cyl), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(cyl)6 factor(cyl)8
#> 105.14 78.18 247.96
mtcars %>% group_by(cyl) %>% summarize(mean = mean(disp))
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 105.1364
#> 2 6 183.3143
#> 3 8 353.1000
如您所见,回归将截距设置为group cyl = 4的平均disp。
接下来,我运行回归disp ~ factor(gear)
:
lm(disp ~ factor(gear), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(gear), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(gear)4 factor(gear)5
#> 326.3 -203.3 -123.8
mtcars %>% group_by(gear) %>% summarize(mean = mean(disp))
#> # A tibble: 3 x 2
#> gear mean
#> <dbl> <dbl>
#> 1 3 326.3000
#> 2 4 123.0167
#> 3 5 202.4800
再一次,回归的输出是群体的意思。
现在要合并它们,我的回归公式为disp ~ factor(cyl) * factor(gear)
,相当于disp ~ factor(cyl) + factor(gear) + factor(cyl):factor(gear)
:
lm(disp ~ factor(cyl)*factor(gear), mtcars)
#>
#> Call:
#> lm(formula = disp ~ factor(cyl) * factor(gear), data = mtcars)
#>
#> Coefficients:
#> (Intercept) factor(cyl)6
#> 120.10 121.40
#> factor(cyl)8 factor(gear)4
#> 237.52 -17.47
#> factor(gear)5 factor(cyl)6:factor(gear)4
#> -12.40 -60.23
#> factor(cyl)8:factor(gear)4 factor(cyl)6:factor(gear)5
#> NA -84.10
#> factor(cyl)8:factor(gear)5
#> -19.22
mtcars %>% group_by(cyl, gear) %>% summarize(mean(disp))
#> # A tibble: 8 x 3
#> # Groups: cyl [?]
#> cyl gear `mean(disp)`
#> <dbl> <dbl> <dbl>
#> 1 4 3 120.1000
#> 2 4 4 102.6250
#> 3 4 5 107.7000
#> 4 6 3 241.5000
#> 5 6 4 163.8000
#> 6 6 5 145.0000
#> 7 8 3 357.6167
#> 8 8 5 326.0000