glmnet / glmnetUtils:重复交叉验证

时间:2018-07-19 09:30:32

标签: r glmnet

我正在尝试使用alpha / lambda运行重复的10倍CV(glmnetglmnetUtils)。我建议的工作流程是:

a)使用alpha的11个值拟合建议的模型,

b)运行进程X(在这种情况下为10次),

c)平均结果,并且

d)使用alphalambdas = "lambda.1se")的最佳组合来拟合最终模型。

为了解决交流问题,我使用了下面的代码;但是,这10次迭代的结果完全相同。

library(glmnet)
library(glmnetUtils)
library(doParallel)

data(BinomialExample)


# Create alpha sequence; fix folds

alpha <- seq(.5, 1, .05)

set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)


# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha

extractGlmnetInfo <- function(object)
{
  # Find lambdas
  lambda1se <- object$lambda.1se

  # Determine where lambdas fall in path
  which1se <- which(object$lambda == lambda1se)

  # Create data frame with selected lambdas and corresponding error
  data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}


#Run glmnet

cl <- makeCluster(detectCores())
registerDoParallel(cl)

enet <- foreach(i = 1:10,
                .inorder = FALSE,
                .multicombine = TRUE,
                .packages = "glmnetUtils") %dopar%
  {
    cv <- cva.glmnet(x, y,
                     foldid = folds,
                     alpha = alpha,
                     family = "binomial",
                     parallel = TRUE)
    }

stopCluster(cl)


# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha

cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)

cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)

cv.rep <- data.frame(cbind(alpha, cv.rep))

问题

  1. 我的理解是,在alpha上进行交叉验证时,折痕应该固定。因此,是否应该多次set.seed()为每个迭代生成不同的folds并分别运行每个迭代,而不是循环遍历它们?例如:

    # Set folds for first iteration
    
    set.seed(1)
    folds1 <- sample(1:10, size = length(y), replace = TRUE)
    
    
    # Run first iteration
    
    enet1 <- cva.glmnet(x, y,
                    foldid = folds1,
                    alpha = alpha,
                    family = "binomial")
    
    
    # Set folds for second iteration
    
    set.seed(2)
    folds2 <- sample(1:10, size = length(y), replace = TRUE)
    
    
    # Run second iteration
    
    enet2 <- cva.glmnet(x, y,
                    foldid = folds2,
                    alpha = alpha,
                    family = "binomial")
    
  2. 或者是否有办法修复folds并遍历迭代,从而利用并行处理?

  3. Re:1。中提出的选项,如何确定使用folds和{{1}的最佳组合来适合最终模型的alpha的配置}?这个决定是任意的吗?

笔记本电脑。我没有将lambda用于此特定任务。

0 个答案:

没有答案