Question

我正在寻找随机森林的应用程序，我在Kaggle上找到了以下知识竞赛：

https://www.kaggle.com/c/forest-cover-type-prediction

遵循

的建议

https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn，

我使用 sklearn 构建了一个包含500棵树的随机森林。

.oob_score_ 约为2％，但坚持设定的得分为~75％。

只有七个类可以分类，所以2％真的很低。当我通过验证时，我也一直得分接近75％。

有人能解释 .oob_score_ 与坚持/交叉验证分数之间的差异吗？我希望它们是相似的。

这里有一个类似的问题：

https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests

编辑：我认为这也可能是一个错误。

代码由我发布的第二个链接中的原始海报提供。唯一的变化是您在构建随机林时必须设置 oob_score = True 。

我没有保存我做过的交叉验证测试，但如果有人需要看，我可以重做它。

Answer 1

问：任何人都可以解释这种差异......

答： sklearn.ensemble.RandomForestClassifier 对象及其观察到的 .oob_score_ 属性值不是与bug有关的问题。

首先，基于 RandomForest 的预测变量 { Classifier | Regressor } 属于所谓的整体方法的特定角落，因此请注意，典型的方法，包括交叉验证，不像其他AI / ML学习者那样工作。

RandomForest "inner"-logic works heavily with RANDOM-PROCESS，其中包含已知X的样本（DataSET y == { labels ）（针对分类器）{{1 （对于Regressor）| targets，在整个森林生成中被分割，其中树通过RANDOMLY将DataSET拆分为部分，树可以看到和部分，树将不会 bootstraped 看（形成一个内部oob-subSET）。

除了对过度拟合敏感度的其他影响之外， RandomForest 集合不需要进行交叉验证，因为它不会因设计而过度拟合。许多论文以及 Breiman's （伯克利）的经验证据都为这种陈述提供了支持，因为它们带来了明显的证据，即CV-ed预测器具有相同的 }

.oob_score_

还应告知一个人，默认值不是最好的，在任何情况下都不能很好地服务。人们应该注意问题域，以便在进一步推进之前提出一套合理的 import sklearn.ensemble aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators = 10, # The number of trees in the forest. criterion = 'mse', # { Regressor: 'mse' | Classifier: 'gini' } max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = 'auto', max_leaf_nodes = None, bootstrap = True, oob_score = False, # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET n_jobs = 1, # { 1 | n-cores | -1 == all-cores } random_state = None, verbose = 0, warm_start = False ) aRF_PREDICTOR.estimators_ # aList of <DecisionTreeRegressor> The collection of fitted sub-estimators. aRF_PREDICTOR.feature_importances_ # array of shape = [n_features] The feature importances (the higher, the more important the feature). aRF_PREDICTOR.oob_score_ # float Score of the training dataset obtained using an out-of-bag estimate. aRF_PREDICTOR.oob_prediction_ # array of shape = [n_samples] Prediction computed with out-of-bag estimate on the training set. aRF_PREDICTOR.apply( X ) # Apply trees in the forest to X, return leaf indices. aRF_PREDICTOR.fit( X, y[, sample_weight] ) # Build a forest of trees from the training set (X, y). aRF_PREDICTOR.fit_transform( X[, y] ) # Fit to data, then transform it. aRF_PREDICTOR.get_params( [deep] ) # Get parameters for this estimator. aRF_PREDICTOR.predict( X ) # Predict regression target for X. aRF_PREDICTOR.score( X, y[, sample_weight] ) # Returns the coefficient of determination R^2 of the prediction. aRF_PREDICTOR.set_params( **params ) # Set the parameters of this estimator. aRF_PREDICTOR.transform( X[, threshold] ) # Reduce X to its most important features. 参数化。

问：什么是好的.oob_score_？

答：.oob_score_是随机的！。。。。。。 .....是的，必须（随机）

虽然这听起来像是一个挑衅性的尾声，但不要把你的希望抛弃。 RandomForest合奏是一个很棒的工具。一些问题可能伴随着特征中的分类值（DataSET ensemble），但是一旦你不需要既不偏见也不过度拟合，处理整体的成本仍然足够。 那太好了，不是吗？

由于需要能够在后续重播中重现相同的结果，因此建议（重新）设置 X ＆amp; numpy.random 到RANDOM-PROCESS之前的知识状态（嵌入到RandomForest集合的每个boostrapping中）。这样做，人们可能会发现＆＃34; de-noised＆＃34;基于 .set_params( random_state = ... ) 的预测器向更强 RandomForest 方向的进展，而非 trully 由更多集合成员（ .oob_score_ ）引入的预测能力，较少受约束的树构造（n_estimators，max_depth等）而不仅仅是随机只是＆＃34;更好的运气＆＃34;在RANDOM-PROCESS期间如何拆分DataSET ......

更接近更好的解决方案通常涉及更多的树进入整体（RandomForest决策基于多数投票，因此10估计器不是在高度复杂的DataSET上做出正确决策的重要基础）。 2000以上的数字并不少见。人们可以迭代一系列的sizings（RANDOM-PROCESS保持在状态完全控制下）来展示整体＆＃34;改进＆＃34;。

如果 max_leaf_nodes 的初始值大约在0.51 - 0.53左右，那么你的整体 1％ - 比RANDOM-GUESS
只有在将基于集合的预测器变为更好的东西之后，才可能会在特征工程等方面进行一些额外的技巧。

.oob_score_

sklearn random forest：.oob_score_太低了？

1 个答案:

问：任何人都可以解释这种差异......

问：什么是好的.oob_score_？

答：.oob_score_是随机的！。。。。。。 .....是的，必须（随机）

sklearn random forest：.oob_score_太低了？

1 个答案:

问：任何人都可以解释这种差异......

问：什么是好的.oob_score_？

答：.oob_score_是随机的！ 。 。 。 。 。 。 .....是的，必须（随机）

答：.oob_score_是随机的！。。。。。。 .....是的，必须（随机）