似乎使用相同数据但列顺序不同会改变结果。
最小,可重复的例子:
n
结果:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
然后我尝试了:
Stochastic Gradient Boosting
157 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.7609191 0.5163703
1 100 0.7934216 0.5817734
1 150 0.7977230 0.5897796
2 50 0.7858235 0.5667749
2 100 **0.8188897** **0.6316548**
2 150 **0.8194363** **0.6329037**
3 50 **0.7889436** **0.5713790**
3 100 0.8130564 0.6195719
3 150 0.8221348 0.6383441
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
3, shrinkage = 0.1 and n.minobsinnode = 10.
结果:
finalVars <- colnames(training)
# reorder columns
finalVars <- finalVars[order(finalVars)]
set.seed(825)
gbmFit1 <- train(Class ~ ., data = training[, finalVars],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
我们可以从粗体数字中看出,使用不同的列顺序会产生不同的结果。
Stochastic Gradient Boosting
157 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.7609191 0.5163703
1 100 0.7934216 0.5817734
1 150 0.7977230 0.5897796
2 50 0.7858235 0.5669550
2 100 **0.8194779** **0.6331626**
2 150 **0.8207279** **0.6354601**
3 50 **0.7946936** **0.5831441**
3 100 0.8130564 0.6195719
3 150 0.8220931 0.6381234
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
3, shrinkage = 0.1 and n.minobsinnode = 10.
此问题适用于我检查的其他几个型号:rpart,C5.0。有谁知道为什么会这样?
答案 0 :(得分:0)
你不是要发现不同的结果,而是使用&#34; gbm&#34;算法本身。在&#34; gbm&#34;中,重新排序列与更改种子非常相似。