我正在尝试使用alpha
/ lambda
运行重复的10倍CV(glmnet
和glmnetUtils
)。我建议的工作流程是:
a)使用alpha
的11个值拟合建议的模型,
b)运行进程X(在这种情况下为10次),
c)平均结果,并且
d)使用alpha
和lambda
(s = "lambda.1se"
)的最佳组合来拟合最终模型。
为了解决交流问题,我使用了下面的代码;但是,这10次迭代的结果完全相同。
library(glmnet)
library(glmnetUtils)
library(doParallel)
data(BinomialExample)
# Create alpha sequence; fix folds
alpha <- seq(.5, 1, .05)
set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)
# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha
extractGlmnetInfo <- function(object)
{
# Find lambdas
lambda1se <- object$lambda.1se
# Determine where lambdas fall in path
which1se <- which(object$lambda == lambda1se)
# Create data frame with selected lambdas and corresponding error
data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}
#Run glmnet
cl <- makeCluster(detectCores())
registerDoParallel(cl)
enet <- foreach(i = 1:10,
.inorder = FALSE,
.multicombine = TRUE,
.packages = "glmnetUtils") %dopar%
{
cv <- cva.glmnet(x, y,
foldid = folds,
alpha = alpha,
family = "binomial",
parallel = TRUE)
}
stopCluster(cl)
# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha
cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)
cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)
cv.rep <- data.frame(cbind(alpha, cv.rep))
问题
我的理解是,在alpha
上进行交叉验证时,折痕应该固定。因此,是否应该多次set.seed()
为每个迭代生成不同的folds
并分别运行每个迭代,而不是循环遍历它们?例如:
# Set folds for first iteration
set.seed(1)
folds1 <- sample(1:10, size = length(y), replace = TRUE)
# Run first iteration
enet1 <- cva.glmnet(x, y,
foldid = folds1,
alpha = alpha,
family = "binomial")
# Set folds for second iteration
set.seed(2)
folds2 <- sample(1:10, size = length(y), replace = TRUE)
# Run second iteration
enet2 <- cva.glmnet(x, y,
foldid = folds2,
alpha = alpha,
family = "binomial")
或者是否有办法修复folds
并遍历迭代,从而利用并行处理?
Re:1。中提出的选项,如何确定使用folds
和{{1}的最佳组合来适合最终模型的alpha
的配置}?这个决定是任意的吗?
笔记本电脑。我没有将lambda
用于此特定任务。