广义线性模型的不同可能变量组合

时间:2014-03-26 14:29:37

标签: r

R中是否有方法为具有不同组合的数据框中的每个不同变量运行GLM,例如

如果我有4个解释变量,我可以将Y建模为

m1 = glm(Y ~ V1, data = d)
m2 = glm(Y ~ V1 + V2, data = d)
m3 = glm(Y ~ V1 + V2 + V3, data = d)
m4 = glm(Y ~ V1 + V2 + V3 + V4, data = d)

但是,我也可以

m5 = glm(Y ~ V1 + V2 + V4, data = d)

等等。

R中是否有方法可以选择数据框中所有不同的变量组合,以查看哪些变量可以作为最佳预测变量?

2 个答案:

答案 0 :(得分:10)

这称为疏浚:

library(MuMIn)
data(Cement)
fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)

Global model call: lm(formula = y ~ ., data = Cement)
---
Model selection table 
   (Intrc)    X1      X2      X3      X4 df  logLik  AICc delta weight
4    52.58 1.468  0.6623                  4 -28.156  69.3  0.00  0.566
12   71.65 1.452  0.4161         -0.2365  5 -26.933  72.4  3.13  0.119
8    48.19 1.696  0.6569  0.2500          5 -26.952  72.5  3.16  0.116
10  103.10 1.440                 -0.6140  4 -29.817  72.6  3.32  0.107
14  111.70 1.052         -0.4100 -0.6428  5 -27.310  73.2  3.88  0.081
15  203.60       -0.9234 -1.4480 -1.5570  5 -29.734  78.0  8.73  0.007
16   62.41 1.551  0.5102  0.1019 -0.1441  6 -26.918  79.8 10.52  0.003
13  131.30               -1.2000 -0.7246  4 -35.372  83.7 14.43  0.000
7    72.07        0.7313 -1.0080          4 -40.965  94.9 25.62  0.000
9   117.60                       -0.7382  3 -45.872 100.4 31.10  0.000
3    57.42        0.7891                  3 -46.035 100.7 31.42  0.000
11   94.16        0.3109         -0.4569  4 -45.761 104.5 35.21  0.000
2    81.48 1.869                          3 -48.206 105.1 35.77  0.000
6    72.35 2.312          0.4945          4 -48.005 109.0 39.70  0.000
5   110.20               -1.2560          3 -50.980 110.6 41.31  0.000
1    95.42                                2 -53.168 111.5 42.22  0.000

答案 1 :(得分:4)

如果您只想使用基础R而没有允许您进行疏浚的软件包,则可以使用combn函数并列出所有可能的GLM对象:

d <- data.frame(replicate(5, rnorm(10)))
names(d) <- c('Y', paste0('V', 1:4))
dep_var <- 'Y'
indep_vars <- setdiff(names(d), dep_var)

glms <- Reduce(append, lapply(seq_along(indep_vars),
  function(num_vars) {
    Reduce(append, apply(combn(length(indep_vars), num_vars), 2, function(vars) {
      formula_string <- paste(c(dep_var, paste(indep_vars[vars], collapse = "+")), collapse = '~')
      structure(list(glm(as.formula(formula_string), data = d)), .Names = formula_string)
    }))
  }
))

print(names(glms))
# [1] "Y~V1"          "Y~V2"          "Y~V3"          "Y~V4"          "Y~V1+V2"       "Y~V1+V3"       "Y~V1+V4"       "Y~V2+V3"       "Y~V2+V4"       "Y~V3+V4"       "Y~V1+V2+V3"    "Y~V1+V2+V4"
# [13] "Y~V1+V3+V4"    "Y~V2+V3+V4"    "Y~V1+V2+V3+V4"

print(glms[["Y~V2+V3+V4"]])

# Call:  glm(formula = as.formula(formula_string), data = d)
#
# Coefficients:
# (Intercept)           V2           V3           V4
#     0.12721      0.04748      0.11369     -0.04258

# Degrees of Freedom: 9 Total (i.e. Null);  6 Residual
# Null Deviance:      8.932
# Residual Deviance: 8.695  AIC: 36.98