我有一个名为cost的数据集,它有大约400个观察值,因变量Y和10个独立变量。其中两个,PB1和PB2,必须在模型中。在剩下的八个变量中,我想在V0,V1,V2和V3中选择一个,在L0,L1,L2和L3中选择一个更适合模型。同时,我也希望交叉验证模型。我尝试使用前向逐步回归和CV,但似乎模型可能在同一模型中包含V0和V1(或L1和L3等),这不符合我的目的。有关如何在R中编码的任何建议?
到目前为止,我的代码如下:
set.seed(6)
folds = sample(rep(1:10, length = nrow(cost)))
cv.errors = matrix(NA, 5, 10)
for (k in 1:5){
best.fit = regsubsets(Y~., data = cost[folds!=k,], nvmax = 10, method = "forward")
for(i in 1:5){
pred = predict(best.fit, cost[folds == k,], id = i)
cv.errors[k, i] = mean((cost$Y[folds ==k] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
regsubsets中没有预测功能。相反,predict.regsubsets被定义为函数先验,例如
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}