我正在进行许多KNN研究。我目前正在训练整个数据的模型,并使用repeatcv方法拆分数据 分为4个部分,以估算准确性。我想知道两件事
是否使用repeatcv在训练函数中规定重复数= 100与使用createDataPartition()和predict()手动选择要分配给测试集的数据的四分之一相同。对我而言,唯一的区别是,在第一种情况下,该模型在所有数据的预测能力方面进行了优化,而在后一种情况下,仅在训练分区上进行了优化。因此,如果我只是评估预测变量的影响,而不是尝试将模型应用于新数据,则使用train会更好。这是正确的吗?
是否可以像评估Caret混淆矩阵那样获得训练期间遗漏的折痕的无信息率?
我以鸢尾花数据集为例包含了以下代码
set.seed(10)
#using the train and resampling procedure on whole data to determine accuracy
train_controlIris <- trainControl(method="repeatedcv", number=4,repeats = 100, returnResamp = 'final',savePredictions = 'final' )
modelIris <- train( Species~.,data=iris, method="knn",trControl=train_controlIris)
confMatrixRepeatCV<-confusionMatrix(modelIris)
print(confMatrixRepeatCV)
#The confusion matrix gives no No Information Rate in the output
Cross-Validated (4 fold, repeated 100 times) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction setosa versicolor virginica
setosa 33.3 0.0 0.0
versicolor 0.0 31.7 1.5
virginica 0.0 1.6 31.9
Accuracy (average) : 0.9689
#now creating a partition and using the data to predict the test partition
index <- createDataPartition(iris$Species, p=0.75, list=FALSE)
#make test and train sets
data.train<- iris[index,]
data.test<- iris[-index,]
model_knn <- train(data.train[, 1:4], data.train$Species, method='knn')
predictions<-predict(object=model_knn,data.test[,1:4])
evaluationCaret<-confusionMatrix(predictions,data.test$Species)
print(evaluationCaret)
#this confusion matrix does give a No information Rate
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 12 0 0
versicolor 0 11 1
virginica 0 1 11
Overall Statistics
Accuracy : 0.9444
95% CI : (0.8134, 0.9932)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 1.728e-14
Kappa : 0.9167
Mcnemar's Test P-Value : NA