如何在数据集上运行线性回归,每次将单个变量作为因变量?

时间:2016-06-17 06:53:08

标签: r analytics rstudio regression linear-regression

我有一个数据集,所有数字变量都被称为'dt'..想要将每个变量作为因变量,并使用逐步回归找到剩余预测变量的最佳组合..如果得到的“最佳组合”给出调整后的R ^ 2> 0.70,将其输出到控制台。这是我天真的尝试。

for(i in ncol(dt)){
    nul<-lm(dt[,i]~1,data=dt)
    ful<-lm(dt[,i]~.,data=dt)
    model<-step(nul,scope = list(lower=nul,upper=ful),direction="forward",trace=FALSE)
    if((summary(lm(as.formula(model$call),data=dt)))$adj.r.squared>0.70){
        print(as.formula(model$call))
        cat(paste("\n"))
    }
}

这是我得到的不良输出:

dt[, i] ~ Y

Warning messages:
1: attempting model selection on an essentially perfect fit is nonsense 
2: In summary.lm(lm(as.formula(model$call), data = dt)) :
essentially perfect fit: summary may be unreliable

1 个答案:

答案 0 :(得分:1)

正如@ 42-正确地指出的那样,你将得到的是统计和垃圾&#34;。

但如果你坚持&#34;测试&#34;无论如何,使用leaps :: regsubsets很容易得到多个线性mod的r ^ 2。

library(leaps)
a <- regsubsets(as.matrix(x=swiss[,-1]),y=swiss[,1], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)
summary(a) 

Subset selection object
5 Variables 
                 Forced in Forced out
Examination          FALSE      FALSE
Education            FALSE      FALSE
Catholic             FALSE      FALSE
Infant.Mortality     FALSE      FALSE
100 subsets of each size up to 1
Selection Algorithm: exhaustive
         Agriculture Examination Education Catholic Infant.Mortality
1  ( 1 ) " "         " "         " "       " "      "*"             
1  ( 2 ) "*"         " "         " "       " "      " "             
1  ( 3 ) " "         "*"         " "       " "      " "             
1  ( 4 ) " "         " "         " "       "*"      " "             
1  ( 5 ) " "         " "         "*"       " "      " "     

在上面的示例中,5 lm mods&#39; Fertility&#39;作为因变量,并且每个剩余变量作为每个模型的单个预测变量,例如,生育〜婴儿,生育〜农业等。

summary(a)$rsq # returns R^2 for each of the five models

[1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043

通过将上述内容更改为函数,请说:

nonsense_lm <- function(data, x) regsubsets(as.matrix(x=data[,-x]),y=data[,x], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)

然后将每个变量作为预测器循环:

nonsense <- lapply(1:ncol(swiss), function(x) nonsense_lm(swiss, x))
lapply(nonsense, function(x)summary(x)$rsq)

 [[1]]
 [1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043

 [[2]]
 [1] 0.8558076 0.8121654 0.5785572 0.4961365 0.2715248

 [[3]]
 [1] 0.7844437 0.7729180 0.7054873 0.4961365 0.2132834

 [[4]]
 [1] 0.7729180 0.5456765 0.4474043 0.2715248 0.2137402

 [[5]]
 [1] 0.5785572 0.5660736 0.5135628 0.2137402 0.2132834

 [[6]]
 [1] 0.9703145 0.8121654 0.7844437 0.5456765 0.5135628

同样,请注意R ^ 2实际上是统计的&#34;垃圾&#34;。有一个适当的测试问题是任何分析的最关键步骤。