Question

每个人，我正在尝试通过for循环搜索最佳参数。但是，结果让我感到困惑。以下代码应该提供相同的结果，因为参数“mtry”是相同的。

       gender Partner   tenure Churn
3521     Male      No 0.992313   Yes
2525.1   Male      No 4.276666    No
567      Male     Yes 2.708050    No
8381   Female      No 4.202127   Yes
6258   Female      No 0.000000   Yes
6569     Male     Yes 2.079442    No
27410  Female      No 1.550804   Yes
6429   Female      No 1.791759   Yes
412    Female     Yes 3.828641    No
4655   Female     Yes 3.737670    No

RFModel = randomForest(Churn ~ .,
                     data = ggg,
                     ntree = 30,
                     mtry = 2,
                     importance = TRUE,
                     replace = FALSE)
print(RFModel$confusion)

    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2

for(i in c(2)){
   RFModel = randomForest(Churn ~ .,
                     data = Trainingds,
                     ntree = 30,
                     mtry = i,
                     importance = TRUE,
                     replace = FALSE)
   print(RFModel$confusion)
}

     No Yes class.error
No   3   2         0.4
Yes  2   3         0.4

Code1和code2应提供相同的输出。

Answer 1

每次都会得到略有不同的结果，因为随机性已内置于算法中。为了构建每个树，算法重新采样数据帧，并随机选择mtry列以从重采样数据帧构建树。如果您希望使用相同参数（例如，mtry，ntree）构建的模型每次都给出相同的结果，则需要设置随机种子。

例如，让我们运行randomForest 10次并检查每次运行的均方误差的平均值。请注意，平均值mse每次都不同：

library(randomForest)

replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))

[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021

如果您运行上述代码，您将获得另外10个与上述值不同的值。

如果您希望能够重现结果，则使用相同参数（例如mtry和ntree）运行给定模型，然后您可以设置随机种子。例如：

set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)

[1] 6.017737

如果使用相同的种子值，您将得到相同的结果，否则会得到不同的结果。使用较大的ntree值将减少但不能消除模型运行之间的可变性。

更新：当我使用您提供的数据样本运行代码时，我不会每次都获得相同的结果。即使使用replace=TRUE，这会导致数据框无需替换即被采样，因此每次选择构建树的列可能会有所不同：

> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 20%
Confusion matrix:
    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2

这里有一组与内置iris数据框类似的结果：

> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          2        48        0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 6%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          6        44        0.12

您还可以查看每个模型运行生成的树，它们通常会有所不同。例如，我说我运行以下代码三次，将结果存储在对象m1，m2和m3中。

randomForest(Churn ~ .,
             data = ggg,
             ntree = 30,
             mtry = 2,
             importance = TRUE,
             replace = FALSE)

现在让我们看看每个模型对象的前四棵树，我已经粘贴在下面。输出是一个列表。您可以看到每个模型运行的第一个树是不同的。第二个树对于前两个模型运行是相同的，但对于第三个不同，依此类推。

check.trees = lapply(1:4, function(i) {
  lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
  })

check.trees

[[1]]
[[1]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner    1.000000      1       <NA>
2             4              5    gender    1.000000      1       <NA>
3             0              0      <NA>    0.000000     -1         No
4             0              0      <NA>    0.000000     -1        Yes
5             6              7    tenure    2.634489      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[1]]$m2
  left daughter right daughter split var split point status prediction
1             2              3    gender    1.000000      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    1.850182      1       <NA>
4             0              0      <NA>    0.000000     -1        Yes
5             0              0      <NA>    0.000000     -1         No

[[1]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.249904      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[2]]
[[2]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1         No


[[3]]
[[3]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1        Yes

[[3]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[3]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.129427      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[4]]
[[4]]$m1
  left daughter right daughter split var split point status prediction
1             2              3    tenure    1.535877      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    4.015384      1       <NA>
4             0              0      <NA>    0.000000     -1         No
5             6              7    tenure    4.239396      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[4]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[4]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

通过r中的for循环搜索最佳随机森林参数

1 个答案: