如果将错误的索引结构与trainControl repeatcv一起使用,会发生什么情况?

时间:2019-08-24 14:59:46

标签: r cross-validation caret

我做错了,错过了通过蒙特卡洛交叉验证(LGOCV)重采样的代码和重复进行交叉验证(repeatedcv)的代码。它似乎可以以某种方式工作,但是我仍然会对如何执行重新采样和验证感到好奇。

数据集说明:

12,000 rows
with 100    rows having class = "is_class"
with 11,900 rows having class = "no_class"

创建索引:

   trainIndex <- createDataPartition(full_dataset$class, p = .8, 
                                      list = FALSE, 
                                      times = 1)

trainControl的LGOCV代码段:

ctrl <- trainControl(method = "LGOCV",
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE,
                     index = list(TrainSet = trainIndex),
                     sample= "up",
                     savePredictions = TRUE)

使用了什么代替(repeatedcv)

ctrl <- trainControl(method = "repeatedcv",      
                     repeats = 5,
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE,
                     sample = "up",
                     index = list(TrainSet = trainIndex),
                     savePredictions = TRUE)

培训的执行方式:

plsFit <- train(x = full_dataset[,fullSet], 
                y = full_dataset$class,
                method = "pls",
                tuneGrid = expand.grid(ncomp = 1:10),
                preProc = c("center","scale"),
                metric = "ROC",
                probMethod = "Bayes",
                trControl = ctrl)

由于TrainSet包含〜10,000个“折叠”,每个折叠只有1行ID,而repeatcv的标准设置是10折叠CV,所以我不确定会发生什么。

用10k折交叉验证设置自动覆盖10折看起来不是吗?

plsFit的输出:

> 12000 samples   
>   650 predictor
>     2 classes: "is_class", "no_class"
> 
> Pre-Processing: centered(650), scaled(650) 
> Resampling: Cross-Validated (10 fold, repeated 5 times) 
> Summary of Sample sizes: 10000 Additional sampling using up-sampling prior to pre-processing
> ....

0 个答案:

没有答案