我真的很喜欢使用插入符号至少在建模的早期阶段,特别是因为它非常容易使用重采样方法。然而,我正在开发一个模型,其中训练集通过半监督自我训练添加了相当数量的案例,并且我的交叉验证结果因此而真正偏离。我的解决方案是使用验证集来测量模型性能,但我看不到直接在插入符中使用验证集的方法 - 我是否遗漏了某些东西或者这只是不支持?我知道我可以编写自己的包装来执行插入符号通常用于m的操作,但是如果有一个解决方法而不必这样做会非常好。
以下是我遇到的一个简单例子:
> library(caret)
> set.seed(1)
>
> #training/validation sets
> i <- sample(150,50)
> train <- iris[-i,]
> valid <- iris[i,]
>
> #make my model
> tc <- trainControl(method="cv")
> model.rf <- train(Species ~ ., data=train,method="rf",trControl=tc)
>
> #model parameters are selected using CV results...
> model.rf
100 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 90, 90, 90, 89, 90, 92, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.971 0.956 0.0469 0.0717
3 0.971 0.956 0.0469 0.0717
4 0.971 0.956 0.0469 0.0717
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
>
> #have to manually check validation set
> valid.pred <- predict(model.rf,valid)
> table(valid.pred,valid$Species)
valid.pred setosa versicolor virginica
setosa 17 0 0
versicolor 0 20 1
virginica 0 2 10
> mean(valid.pred==valid$Species)
[1] 0.94
我原本以为我可以通过为summaryFunction()
对象创建自定义trainControl()
来实现此目的但我无法看到如何引用我的模型对象以从验证集获取预测(文档 - {{ 3}} - 仅列出“data”,“lev”和“model”作为可能的参数。例如,这显然不起作用:
tc$summaryFunction <- function(data, lev = NULL, model = NULL){
data.frame(Accuracy=mean(predict(<model object>,valid)==valid$Species))
}
编辑:为了想出一个真正丑陋的修复,我一直在寻找是否可以从另一个函数的范围访问模型对象,但我甚至没有看到它们存储在任何地方的模型。希望有一些优雅的解决方案,我甚至没有接近看到......
> tc$summaryFunction <- function(data, lev = NULL, model = NULL){
+ browser()
+ data.frame(Accuracy=mean(predict(model,valid)==valid$Species))
+ }
> train(Species ~ ., data=train,method="rf",trControl=tc)
note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
Called from: trControl$summaryFunction(testOutput, classLevels, method)
Browse[1]> lapply(sys.frames(),function(x) ls(envi=x))
[[1]]
[1] "x"
[[2]]
[1] "cons" "contrasts" "data" "form" "m" "na.action" "subset"
[8] "Terms" "w" "weights" "x" "xint" "y"
[[3]]
[1] "x"
[[4]]
[1] "classLevels" "funcCall" "maximize" "method" "metric" "modelInfo"
[7] "modelType" "paramCols" "ppMethods" "preProcess" "startTime" "testOutput"
[13] "trainData" "trainInfo" "trControl" "tuneGrid" "tuneLength" "weights"
[19] "x" "y"
[[5]]
[1] "data" "lev" "model"
答案 0 :(得分:6)
看看trainControl
。现在有选项可以直接指定用于建模数据的数据行(index
参数)以及应该使用哪些行来计算保留估计值(称为indexOut
)。我认为这就是你要找的东西。
最高
答案 1 :(得分:3)
我想我可能已经为此找到了解决办法,但我不是百分之百,它正在做我想做的事情,我仍然希望有人提出更优雅的东西。无论如何,我意识到在我的训练集中包含验证集可能最有意义,只需定义重采样度量以仅使用验证数据。我认为这应该可以解决上面的例子:
> library(caret)
> set.seed(1)
>
> #training/validation set indices
> i <- sample(150,50) #note - I no longer need to explictly create train/validation sets
>
> #explicity define the cross-validation indices to be those from the validation set
> tc <- trainControl(method="cv",number=1,index=list(Fold1=(1:150)[-i]),savePredictions=T)
> (model.rf <- train(Species ~ ., data=iris,method="rf",trControl=tc))
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (1 fold)
Summary of sample sizes: 100
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.94 0.907
3 0.94 0.907
4 0.94 0.907
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
>
> #i think this worked because the resampling indices line up?
> all(sort(unique(model.rf$pred$rowIndex)) == sort(i))
[1] TRUE
> #exact contingency from above also indicate that this works
> table(model.rf$pred[model.rf$pred$.mtry==model.rf$bestTune[[1]],c("obs","pred")])
pred
obs setosa versicolor virginica
setosa 17 0 0
versicolor 0 20 2
virginica 0 1 10