我做错了,错过了通过蒙特卡洛交叉验证(LGOCV)重采样的代码和重复进行交叉验证(repeatedcv)的代码。它似乎可以以某种方式工作,但是我仍然会对如何执行重新采样和验证感到好奇。
数据集说明:
12,000 rows
with 100 rows having class = "is_class"
with 11,900 rows having class = "no_class"
创建索引:
trainIndex <- createDataPartition(full_dataset$class, p = .8,
list = FALSE,
times = 1)
trainControl的LGOCV代码段:
ctrl <- trainControl(method = "LGOCV",
summaryFunction = twoClassSummary,
classProbs = TRUE,
index = list(TrainSet = trainIndex),
sample= "up",
savePredictions = TRUE)
使用了什么代替(repeatedcv)
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
sample = "up",
index = list(TrainSet = trainIndex),
savePredictions = TRUE)
培训的执行方式:
plsFit <- train(x = full_dataset[,fullSet],
y = full_dataset$class,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:10),
preProc = c("center","scale"),
metric = "ROC",
probMethod = "Bayes",
trControl = ctrl)
由于TrainSet包含〜10,000个“折叠”,每个折叠只有1行ID,而repeatcv的标准设置是10折叠CV,所以我不确定会发生什么。
用10k折交叉验证设置自动覆盖10折看起来不是吗?
plsFit的输出:
> 12000 samples
> 650 predictor
> 2 classes: "is_class", "no_class"
>
> Pre-Processing: centered(650), scaled(650)
> Resampling: Cross-Validated (10 fold, repeated 5 times)
> Summary of Sample sizes: 10000 Additional sampling using up-sampling prior to pre-processing
> ....