Question

我正在尝试使用alpha / lambda运行重复的10倍CV（glmnet和glmnetUtils）。我建议的工作流程是：

a）使用alpha的11个值拟合建议的模型，

b）运行进程X（在这种情况下为10次），

c）平均结果，并且

d）使用alpha和lambda（s = "lambda.1se"）的最佳组合来拟合最终模型。

为了解决交流问题，我使用了下面的代码；但是，这10次迭代的结果完全相同。

library(glmnet)
library(glmnetUtils)
library(doParallel)

data(BinomialExample)


# Create alpha sequence; fix folds

alpha <- seq(.5, 1, .05)

set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)


# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha

extractGlmnetInfo <- function(object)
{
  # Find lambdas
  lambda1se <- object$lambda.1se

  # Determine where lambdas fall in path
  which1se <- which(object$lambda == lambda1se)

  # Create data frame with selected lambdas and corresponding error
  data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}


#Run glmnet

cl <- makeCluster(detectCores())
registerDoParallel(cl)

enet <- foreach(i = 1:10,
                .inorder = FALSE,
                .multicombine = TRUE,
                .packages = "glmnetUtils") %dopar%
  {
    cv <- cva.glmnet(x, y,
                     foldid = folds,
                     alpha = alpha,
                     family = "binomial",
                     parallel = TRUE)
    }

stopCluster(cl)


# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha

cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)

cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)

cv.rep <- data.frame(cbind(alpha, cv.rep))

问题

我的理解是，在alpha上进行交叉验证时，折痕应该固定。因此，是否应该多次set.seed()为每个迭代生成不同的folds并分别运行每个迭代，而不是循环遍历它们？例如：

# Set folds for first iteration

set.seed(1)
folds1 <- sample(1:10, size = length(y), replace = TRUE)


# Run first iteration

enet1 <- cva.glmnet(x, y,
                foldid = folds1,
                alpha = alpha,
                family = "binomial")


# Set folds for second iteration

set.seed(2)
folds2 <- sample(1:10, size = length(y), replace = TRUE)


# Run second iteration

enet2 <- cva.glmnet(x, y,
                foldid = folds2,
                alpha = alpha,
                family = "binomial")

或者是否有办法修复folds并遍历迭代，从而利用并行处理？
Re：1。中提出的选项，如何确定使用folds和{{1}的最佳组合来适合最终模型的alpha的配置}？这个决定是任意的吗？

笔记本电脑。我没有将lambda用于此特定任务。

glmnet / glmnetUtils：重复交叉验证

0 个答案: