使用分类变量运行回归分析允许您使用得到的截距和系数基本上计算分组均值(并且它是ANOVA分析的常见替代)。例如,在这个例子中,根据婚姻状况预测年龄,截距显示未婚人士的平均年龄,而截距+结婚系数显示已婚人士的平均年龄:
library(tidyverse)
# Make a bunch of two-level factors
gss_clean <- gss_cat %>%
filter(race != "Other") %>%
mutate(relig = recode(relig, Protestant = "Protestant", .default = "Not Protestant"),
marital = recode(marital, Married = "Married", .default = "Not married"),
race = fct_drop(race))
# Super simple model
model_1 <- lm(age ~ marital, data = gss_clean)
# Average age for unmarried people
coefficients(model_1)[1]
#> 46.68678
# Average age for married people
coefficients(model_1)[1] + coefficients(model_1)[2]
#> 49.37324
# Verify
gss_clean %>%
group_by(marital) %>%
summarize(avg_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 x 2
#> marital avg_age
#> <fct> <dbl>
#> 1 Not married 46.7
#> 2 Married 49.4
只要添加正确的系数,这也适用于具有交互的模型。在这种情况下,这种系数组合为数据中的不同组提供了手段:
# Model with interactions
model_2 <- lm(age ~ marital + race + marital * race, data = gss_clean)
# Black, not married
coefficients(model_2)[1]
#> 42.93019
# Black, married
coefficients(model_2)[1] + coefficients(model_2)[2]
#> 46.40878
# White, not married
coefficients(model_2)[1] + coefficients(model_2)[3]
#> 47.73707
# White, married
coefficients(model_2)[1] + coefficients(model_2)[3] + coefficients(model_2)[4]
#> 46.20407
# Verify
gss_clean %>%
group_by(marital, race) %>%
summarize(avg_age = mean(age, na.rm = TRUE))
#> # A tibble: 4 x 3
#> # Groups: marital [?]
#> marital race avg_age
#> <fct> <fct> <dbl>
#> 1 Not married Black 42.9
#> 2 Not married White 47.7
#> 3 Married Black 46.4
#> 4 Married White 49.7
虽然手动执行此操作是可以的,但我希望找到一种更通用的方法来正确添加这些系数,而无需记住所有系数索引。这对于具有3向交互的更复杂模型尤其有用,例如:
# Model with lots of interactions
model_3 <- lm(age ~ marital + race + relig +
marital * race + marital * relig + race * relig +
marital * race * relig,
data = gss_clean)
# Black, not married, not protestant
coefficients(model_3)[1]
#> 38.56587
# Everything else: so much manual adding :(
是否有一些内置的方法可以将适当的系数添加到R中的截距中,或者是否有一些简单的方法来执行此操作并不涉及coefficients(model)[n] + coefficients(model)[m] + ...
的行和行?
答案 0 :(得分:2)
这个怎么样
pred_means <- function(model, continuous = "mean") {
terms <- purrr::map(model$model, unique)[-1]
if(continuous == "mean") {
cont <- !purrr::map_lgl(terms, is.factor)
terms[cont] <- map(terms[cont], mean, na.rm = TRUE)
}
pred_data <- expand.grid(terms)
pred_data$mean <- predict(model, newdata = pred_data)
pred_data
}
默认情况下,此函数将计算所有级别的均值,同时取任何连续预测值的均值。但是,您也可以将continuous
的参数更改为"mean"
以外的任何值,它将为您提供连续预测变量的所有观测值的均值。这是一些例子
pred_means(model_1)
marital mean
1 Not married 46.68678
2 Married 49.37324
pred_means(model_2)
marital race mean
1 Not married White 47.73707
2 Married White 49.68266
3 Not married Black 42.93019
4 Married Black 46.40878
pred_means(model_3)
marital race relig mean
1 Not married White Protestant 51.72343
2 Married White Protestant 51.26028
3 Not married Black Protestant 44.77419
4 Married Black Protestant 47.38913
5 Not married White Not Protestant 44.39232
6 Married White Not Protestant 47.81385
7 Not married Black Not Protestant 38.56587
8 Married Black Not Protestant 42.80000
一个连续预测器的例子
model_4 <- lm(age ~ tvhours + marital + race + relig +
marital * race + marital * relig + race * relig +
marital * race * relig,
data = gss_clean)
pred_means(model_4)
tvhours marital race relig mean
1 11.21739 Not married White Protestant 60.97432
2 11.21739 Married White Protestant 61.58266
3 11.21739 Not married Black Protestant 52.50899
4 11.21739 Married Black Protestant 56.96309
5 11.21739 Not married White Not Protestant 54.22298
6 11.21739 Married White Not Protestant 57.80670
7 11.21739 Not married Black Not Protestant 46.84537
8 11.21739 Married Black Not Protestant 52.43112
head(pred_means(model_4, "all"))
tvhours marital race relig mean
1 12 Not married White Protestant 61.87577
2 2 Not married White Protestant 50.35732
3 4 Not married White Protestant 52.66101
4 1 Not married White Protestant 49.20548
5 3 Not married White Protestant 51.50917
6 0 Not married White Protestant 48.05364