我尝试编写自己的R函数,类似于前向逐步选择step
,但是我没有使用AIC作为选择标准,而是每次预测时都需要评估一些标准变量被添加。该模型的构造原理解释如下。模型应该从与因变量具有最高相关性的预测变量开始。然后,每次根据新模型是否满足以下标准添加另一个预测变量。
重复此过程,直到没有剩余变量满足所有三个标准。我需要的输出可能只是最终模型中所有预测变量的名称,相应的系数和最终模型的r2值。
我的示例数据(y是因变量,x1 - x6是预测变量)
data = structure(list(y = c(23.6, 19.9, 40.7, 40.7, 40.7, 40.7, 40.2,
41.7, 41.7, 28.8), x1 = c(0.1, 0, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.1), x2 = c(0, 0.1, 0, 0, 0,
0, 0, 0.1, 0.1, 0), x3 = c(2277.6, 3038.1, 7797.9, 7797.9,
7797.9, 7797.9, 8392.2, 10127.2, 10127.2, 1799), x4 = c(34228.7,
49815, 76917.1, 76917.1, 76917.1, 76917.1, 75981.4, 74881.1,
74881.1, 56798.2), x5 = c(108786.5, 150465.5, 230397.1, 230397.1,
230397.1, 230397.1, 239300.9, 238493.8, 238493.8, 188799.5),
x6 = c(362.2, 198.2, 656.6, 656.6, 656.6, 656.6, 681,
655.3, 655.3, 222.3)), .Names = c("y", "x1",
"x2", "x3", "x4", "x5", "x6"), row.names = c(NA,
10L), class = "data.frame")
首次尝试我的模型选择功能
modSel = function(data, var){
cor.result = cor(data[,var], df["y"]) #calculate correlation coeff for each variable against y
max.cor = rownames(cor.result)[which.max(cor.result)] #identify the variable with max cor
start.model = lm(as.formula(paste("y", max.cor, sep = "~")), data)
if #my criteria??
else #??]
如果没有太多的编程背景,我真的不知道如何在未知的时间内重复评估我的标准。我意识到要实现这一点可能需要相当多的编码,但对于初学者,我将非常感谢有关整个框架应该是什么样的指导。
干杯
答案 0 :(得分:0)
希望这可以帮助您入门
运行算法的功能
modSel <- function(data) {
# initial
cor.result <- cor(data$y, data[, -which(colnames(data) == "y")])
vars.model <- colnames(cor.result)[which.max(cor.result)]
vars.remaining <- colnames(data)[!colnames(data) %in% c("y", max.cor)]
start.model <- lm(as.formula(paste("y", vars.model, sep = "~")), data)
adj.rsq <- summary(start.model)$adj.r.squared
# algorithm
for (var in vars.remaining) {
# model
vars.test <- paste(vars.model, var, sep="+")
fit <- lm(as.formula(paste("y", vars.test, sep="~")), data)
new.rsq <- summary(fit)$adj.r.squared
# check adj rsq
cond1 <- new.rsq > adj.rsq + .01
# check coefficients
cond2 <- coefficients(fit)[var] > 0
# new var significant
cf <- summary(fit)$coefficients[, 4]
cond3 <- cf[var] < .05
if (cond1 & cond2 & cond3) {
vars.model <- vars.test
adj.rsq <- new.rsq
}
}
lm(as.formula(paste("y", vars.model, sep="~")), data)
}
致电modSel
返回算法中的最佳模型
bestfit <- modSel(data)
总结模型
summary(bestfit)
Call:
lm(formula = as.formula(paste("y", vars.model, sep = "~")), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.8731 -0.3838 -0.3838 0.6273 1.5640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.331e+01 1.941e+00 6.856 0.000241 ***
x1 5.417e+01 5.834e+00 9.285 3.48e-05 ***
x4 1.498e-04 4.460e-05 3.359 0.012099 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9345 on 7 degrees of freedom
Multiple R-squared: 0.9904, Adjusted R-squared: 0.9876
F-statistic: 360.5 on 2 and 7 DF, p-value: 8.721e-08