Question

我想了解更多关于我用sklearn构建的随机森林回归器的信息。例如，如果我不进行正则化处理，这些树平均有多少深度？

这样做的原因是，我需要规范化模型，并想了解当前模型的外观。另外，如果我设置例如datesDisabled还是有必要限制max_leaf_nodes还是这种“问题”本身解决，因为不能将树max_depth设置得太深。这有意义还是我在错误的方向上思考？我在这个方向上找不到任何东西。

Answer 1

如果您想了解构成“随机森林”模型的树木的平均最大深度，则必须单独访问每棵树并查询其最大深度，然后根据获得的结果计算统计量。

首先让我们举一个可重复的随机森林分类器模型示例（取自Scikit-learn documentation）

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=100,
                             random_state=0)
clf.fit(X, y)

现在，我们可以遍历包含每个决策树的estimators_属性。对于每个决策树，我们查询属性tree_.max_depth，存储响应并在完成迭代后取平均值：

max_depth = list()
for tree in clf.estimators_:
    max_depth.append(tree.tree_.max_depth)

print("avg max depth %0.1f" % (sum(max_depth) / len(max_depth)))

这将为您提供组成随机森林模型的每棵树的平均最大深度的想法（正如您所问的，它对于回归模型也完全一样）。

无论如何，作为建议，如果您想对模型进行正则化，则可以在cross-validation和grid/random search范式下获得更好的测试参数假设。在这种情况下，您实际上不必问自己超参数如何相互影响，您只需测试不同的组合，就可以根据交叉验证得分获得最佳组合。

Answer 2

除了@Luca Massaron的回答：

我发现https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py可以应用到森林中的每棵树上

for tree in clf.estimators_:

叶节点的数量可以这样计算：

n_leaves = np.zeros(n_trees, dtype=int)
for i in range(n_trees):
    n_nodes = clf.estimators_[i].tree_.node_count
    # use left or right children as you want 
    children_left = clf.estimators_[i].tree_.children_left
    for x in range(n_nodes):
        if children_left[x] == -1:
            n_leaves[i] += 1

如何获得有关sklearn的随机森林中树木的信息？

2 个答案: