Question

我一直在使用Random Forest进行分类任务。我已经阅读了一些参考文献，其中提到了更多的树木是否更好，并且我们可以使用OOB error rate来获得运行无偏估计的分类错误，因为树木被添加到森林中。

但是，通过使用OOB error rate，我仍然无法确定随机森林中树的最佳数量，因为我们应该设置要评估的最小和最大树数的范围。因为，如果可以在设定范围之外找到最佳树数，则可能。在这里，我需要你的高级建议，如何在OOB错误率中返回Random Forest中最佳的树数。下面是一个使用OOB错误率的代码，其中包含最小和最大树数（10到100）的特定范围：

import matplotlib.pyplot as plt

from collections import OrderedDict
from sklearn.ensemble import RandomForestClassifier


ensemble_clfs = [ ("RandomForestClassifier, max_features=None",RandomForestClassifier(warm_start=True, max_features=None, oob_score=True,))]


error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)


min_estimators = 10
max_estimators = 100

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X, Y)
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))


for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()

结果：

Answer 1

没有您需要瞄准的指定值，它是您感觉舒适的错误。 0.1错误率对你有好处吗？或者你需要0.05吗？这一切都取决于您正在使用的数据。在某些情况下，我看到0.2可以接受。

话虽如此，关于您正在使用的代码的一些事情：

您看到“锯齿状”行作为n_estimators增加的原因是因为您没有正确比较错误率。您需要在random_state中定义RandomForestClassifier，以便从同一个游泳池中取出
在某些时候，性能和速度将比您的准确性更重要，那时您需要决定什么更重要。让我们说n_estimators = 100你有0.2错误，你需要大约10分钟才能运行（取决于你的数据，只是一个粗略的估计）。但是，在n_estimators = 1000，您的错误率为0.18，但运行时间约为25分钟。额外的15分钟是否值得0.02重要？这一切都取决于您正在使用的数据类型。
如果您需要更精细的解决方案，请将您的步骤更改为5，可能为3，并查看其中的等级。步长1可能太小，无法全面了解错误率。从那里开始，一旦你大致了解错误率与速度的差异，你就可以改进范围。

使用python

1 个答案: