循环或应用多个回归,将系数和p值提取到数据框中

时间:2018-01-25 21:16:16

标签: r loops regression apply linear-regression

我有一个包含3个相关(LHS)变量和4个独立(RHS)变量的数据框。我想在每个RHS变量上运行每个LHS变量的线性回归,并将每个回归的结果作为一行存储在数据框中,列为:lhs,rhs,Estimate,Std。错误,t值,Pr(> | t |)。

例如,使用mtcars,我考虑了一个嵌套循环:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat', 'wt', 'qsec')

reg_count <- 1
for (i in lhs){
  for (j in rhs){
    model <- lm(i ~ j, data = mtcars)
    results[reg_count] <- coef(summary(model))
    reg_count <- reg_count + 1
  }
}

然而,由于多种原因,这失败了。有一种简单的方法可以做到这一点吗?理想情况下使用apply()函数而不是循环?

2 个答案:

答案 0 :(得分:2)

这是我将如何做到的。我稍微缩短了你的例子,但这没关系:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat')

models = list()
for (i in lhs){
  for (j in rhs){
    models[[paste(i, "vs", j)]] <- lm(as.formula(paste(i, "~", j)), data = mtcars)
  }
}

如果您想使用apply,则需要以matrix开头。运行时间的差异可以忽略不计。

# with apply:
coefs_mat = expand.grid(lhs, rhs)
mods = apply(coefs_mat, 1, function(row) {
  lm(as.formula(paste(row[1], "~", row[2])), data = mtcars)
})
names(mods) = with(coefs_mat, paste(Var1, "vs", Var2))

两种方法都给出了相同的结果。现在我们可以使用broom::tidy

来提取系数等
# get coefs
library(broom)
coefs = lapply(mods, tidy, simplify = F)
# combine
dplyr::bind_rows(coefs, .id = "mod")
#             mod        term      estimate    std.error  statistic      p.value
# 1     mpg vs hp (Intercept)   30.09886054 1.633921e+00 18.4212465 6.642736e-18
# 2     mpg vs hp          hp   -0.06822828 1.011930e-02 -6.7423885 1.787835e-07
# 3     cyl vs hp (Intercept)    3.00679525 4.254852e-01  7.0667442 7.405351e-08
# 4     cyl vs hp          hp    0.02168354 2.635142e-03  8.2286042 3.477861e-09
# 5    disp vs hp (Intercept)   20.99248341 3.260662e+01  0.6438104 5.245902e-01
# 6    disp vs hp          hp    1.42977003 2.019414e-01  7.0801224 7.142679e-08
# 7   mpg vs drat (Intercept)   -7.52461844 5.476663e+00 -1.3739423 1.796391e-01
# 8   mpg vs drat        drat    7.67823260 1.506705e+00  5.0960421 1.776240e-05

我们还可以提取模型摘要统计数据:

# get summary stats
summ = lapply(mods, glance, simplify = F)
dplyr::bind_rows(summ, .id = "mod")
#            mod r.squared adj.r.squared     sigma statistic      p.value df     logLik
# 1    mpg vs hp 0.6024373     0.5891853  3.862962  45.45980 1.787835e-07  2  -87.61931
# 2    cyl vs hp 0.6929688     0.6827344  1.005944  67.70993 3.477861e-09  2  -44.56307
# 3   disp vs hp 0.6255997     0.6131197 77.089503  50.12813 7.142679e-08  2 -183.41236
# 4  mpg vs drat 0.4639952     0.4461283  4.485409  25.96964 1.776240e-05  2  -92.39996
# 5  cyl vs drat 0.4899134     0.4729105  1.296596  28.81354 8.244636e-06  2  -52.68517
# 6 disp vs drat 0.5044038     0.4878839 88.693360  30.53315 5.282022e-06  2 -187.89934
#         AIC       BIC     deviance df.residual
# 1 181.23863 185.63584    447.67431          30
# 2  95.12614  99.52335     30.35771          30
# 3 372.82473 377.22194 178283.74604          30
# 4 190.79993 195.19714    603.56673          30
# 5 111.37033 115.76754     50.43482          30
# 6 381.79868 386.19588 235995.36410          30

答案 1 :(得分:2)

您可以从expand.grid开始,为相关/自变量对提供一个很好的数据帧。然后将公式和模型添加到数据中。

pairings <- expand.grid(
  lhs = c('mpg', 'cyl', 'disp'),
  rhs = c('hp', 'drat', 'wt', 'qsec')
)

pairings[["formula"]] <- lapply(
  X   = paste(pairings[["lhs"]], "~", pairings[["rhs"]]),
  FUN = as.formula
)

pairings[["model"]] <- lapply(
  X    = pairings[["formula"]],
  FUN  = lm,
  data = mtcars
)

结果:

str(pairings, max.level = 1)
# 'data.frame': 12 obs. of  4 variables:
#  $ lhs    : Factor w/ 3 levels "mpg","cyl","disp": 1 2 3 1 2 3 1 2 3 1 ...
#  $ rhs    : Factor w/ 4 levels "hp","drat","wt",..: 1 1 1 2 2 2 3 3 3 4 ...
#  $ formula:List of 12
#  $ model  :List of 12
#  - attr(*, "out.attrs")=List of 2

pairings[["model"]][[1]]
# Call:
# FUN(formula = X[[i]], data = ..1)
# 
# Coefficients:
# (Intercept)           hp
#    30.09886     -0.06823