我有一个数据集,所有数字变量都被称为'dt'..想要将每个变量作为因变量,并使用逐步回归找到剩余预测变量的最佳组合..如果得到的“最佳组合”给出调整后的R ^ 2> 0.70,将其输出到控制台。这是我天真的尝试。
for(i in ncol(dt)){
nul<-lm(dt[,i]~1,data=dt)
ful<-lm(dt[,i]~.,data=dt)
model<-step(nul,scope = list(lower=nul,upper=ful),direction="forward",trace=FALSE)
if((summary(lm(as.formula(model$call),data=dt)))$adj.r.squared>0.70){
print(as.formula(model$call))
cat(paste("\n"))
}
}
这是我得到的不良输出:
dt[, i] ~ Y
Warning messages:
1: attempting model selection on an essentially perfect fit is nonsense
2: In summary.lm(lm(as.formula(model$call), data = dt)) :
essentially perfect fit: summary may be unreliable
答案 0 :(得分:1)
正如@ 42-正确地指出的那样,你将得到的是统计和垃圾&#34;。
但如果你坚持&#34;测试&#34;无论如何,使用leaps :: regsubsets很容易得到多个线性mod的r ^ 2。
library(leaps)
a <- regsubsets(as.matrix(x=swiss[,-1]),y=swiss[,1], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)
summary(a)
Subset selection object
5 Variables
Forced in Forced out
Examination FALSE FALSE
Education FALSE FALSE
Catholic FALSE FALSE
Infant.Mortality FALSE FALSE
100 subsets of each size up to 1
Selection Algorithm: exhaustive
Agriculture Examination Education Catholic Infant.Mortality
1 ( 1 ) " " " " " " " " "*"
1 ( 2 ) "*" " " " " " " " "
1 ( 3 ) " " "*" " " " " " "
1 ( 4 ) " " " " " " "*" " "
1 ( 5 ) " " " " "*" " " " "
在上面的示例中,5 lm mods&#39; Fertility&#39;作为因变量,并且每个剩余变量作为每个模型的单个预测变量,例如,生育〜婴儿,生育〜农业等。
summary(a)$rsq # returns R^2 for each of the five models
[1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043
通过将上述内容更改为函数,请说:
nonsense_lm <- function(data, x) regsubsets(as.matrix(x=data[,-x]),y=data[,x], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)
然后将每个变量作为预测器循环:
nonsense <- lapply(1:ncol(swiss), function(x) nonsense_lm(swiss, x))
lapply(nonsense, function(x)summary(x)$rsq)
[[1]]
[1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043
[[2]]
[1] 0.8558076 0.8121654 0.5785572 0.4961365 0.2715248
[[3]]
[1] 0.7844437 0.7729180 0.7054873 0.4961365 0.2132834
[[4]]
[1] 0.7729180 0.5456765 0.4474043 0.2715248 0.2137402
[[5]]
[1] 0.5785572 0.5660736 0.5135628 0.2137402 0.2132834
[[6]]
[1] 0.9703145 0.8121654 0.7844437 0.5456765 0.5135628
同样,请注意R ^ 2实际上是统计的&#34;垃圾&#34;。有一个适当的测试问题是任何分析的最关键步骤。