每个人,我正在尝试通过for循环搜索最佳参数。但是,结果让我感到困惑。以下代码应该提供相同的结果,因为参数“mtry”是相同的。
gender Partner tenure Churn
3521 Male No 0.992313 Yes
2525.1 Male No 4.276666 No
567 Male Yes 2.708050 No
8381 Female No 4.202127 Yes
6258 Female No 0.000000 Yes
6569 Male Yes 2.079442 No
27410 Female No 1.550804 Yes
6429 Female No 1.791759 Yes
412 Female Yes 3.828641 No
4655 Female Yes 3.737670 No
RFModel = randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)
print(RFModel$confusion)
No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
for(i in c(2)){
RFModel = randomForest(Churn ~ .,
data = Trainingds,
ntree = 30,
mtry = i,
importance = TRUE,
replace = FALSE)
print(RFModel$confusion)
}
No Yes class.error
No 3 2 0.4
Yes 2 3 0.4
答案 0 :(得分:2)
每次都会得到略有不同的结果,因为随机性已内置于算法中。为了构建每个树,算法重新采样数据帧,并随机选择mtry
列以从重采样数据帧构建树。如果您希望使用相同参数(例如,mtry,ntree)构建的模型每次都给出相同的结果,则需要设置随机种子。
例如,让我们运行randomForest
10次并检查每次运行的均方误差的平均值。请注意,平均值mse每次都不同:
library(randomForest)
replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021
如果您运行上述代码,您将获得另外10个与上述值不同的值。
如果您希望能够重现结果,则使用相同参数(例如mtry
和ntree
)运行给定模型,然后您可以设置随机种子。例如:
set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)
[1] 6.017737
如果使用相同的种子值,您将得到相同的结果,否则会得到不同的结果。使用较大的ntree
值将减少但不能消除模型运行之间的可变性。
更新:当我使用您提供的数据样本运行代码时,我不会每次都获得相同的结果。即使使用replace=TRUE
,这会导致数据框无需替换即被采样,因此每次选择构建树的列可能会有所不同:
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 20%
Confusion matrix:
No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
这里有一组与内置iris
数据框类似的结果:
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 2 48 0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 6%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 6 44 0.12
您还可以查看每个模型运行生成的树,它们通常会有所不同。例如,我说我运行以下代码三次,将结果存储在对象m1
,m2
和m3
中。
randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)
现在让我们看看每个模型对象的前四棵树,我已经粘贴在下面。输出是一个列表。您可以看到每个模型运行的第一个树是不同的。第二个树对于前两个模型运行是相同的,但对于第三个不同,依此类推。
check.trees = lapply(1:4, function(i) {
lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
})
check.trees
[[1]] [[1]]$m1 left daughter right daughter split var split point status prediction 1 2 3 Partner 1.000000 1 <NA> 2 4 5 gender 1.000000 1 <NA> 3 0 0 <NA> 0.000000 -1 No 4 0 0 <NA> 0.000000 -1 Yes 5 6 7 tenure 2.634489 1 <NA> 6 0 0 <NA> 0.000000 -1 Yes 7 0 0 <NA> 0.000000 -1 No [[1]]$m2 left daughter right daughter split var split point status prediction 1 2 3 gender 1.000000 1 <NA> 2 0 0 <NA> 0.000000 -1 Yes 3 4 5 tenure 1.850182 1 <NA> 4 0 0 <NA> 0.000000 -1 Yes 5 0 0 <NA> 0.000000 -1 No [[1]]$m3 left daughter right daughter split var split point status prediction 1 2 3 tenure 2.249904 1 <NA> 2 0 0 <NA> 0.000000 -1 Yes 3 0 0 <NA> 0.000000 -1 No [[2]] [[2]]$m1 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 0 0 <NA> 0 -1 Yes 3 0 0 <NA> 0 -1 No [[2]]$m2 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 0 0 <NA> 0 -1 Yes 3 0 0 <NA> 0 -1 No [[2]]$m3 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 4 5 gender 1 1 <NA> 3 0 0 <NA> 0 -1 No 4 0 0 <NA> 0 -1 Yes 5 0 0 <NA> 0 -1 No [[3]] [[3]]$m1 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 4 5 gender 1 1 <NA> 3 0 0 <NA> 0 -1 No 4 0 0 <NA> 0 -1 Yes 5 0 0 <NA> 0 -1 Yes [[3]]$m2 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 0 0 <NA> 0 -1 Yes 3 0 0 <NA> 0 -1 No [[3]]$m3 left daughter right daughter split var split point status prediction 1 2 3 tenure 2.129427 1 <NA> 2 0 0 <NA> 0.000000 -1 Yes 3 0 0 <NA> 0.000000 -1 No [[4]] [[4]]$m1 left daughter right daughter split var split point status prediction 1 2 3 tenure 1.535877 1 <NA> 2 0 0 <NA> 0.000000 -1 Yes 3 4 5 tenure 4.015384 1 <NA> 4 0 0 <NA> 0.000000 -1 No 5 6 7 tenure 4.239396 1 <NA> 6 0 0 <NA> 0.000000 -1 Yes 7 0 0 <NA> 0.000000 -1 No [[4]]$m2 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 0 0 <NA> 0 -1 Yes 3 0 0 <NA> 0 -1 No [[4]]$m3 left daughter right daughter split var split point status prediction 1 2 3 Partner 1 1 <NA> 2 0 0 <NA> 0 -1 Yes 3 0 0 <NA> 0 -1 No