如何在sklearn上“真实计算”随机森林特征的重要性?

时间:2020-04-07 09:16:49

标签: python machine-learning scikit-learn random-forest decision-tree

我遵循此问题来计算决策树上的特征重要性: scikit learn - feature importance calculation in decision trees

但是,在计算随机森林的特征重要性时,我无法获得正确的值。 例如: 我使用这样的代码并获得随机的林木。我在python上使用了sklearn包。

clf = RandomForestClassifier(n_estimators=2, max_features='log2')
clf.fit(X_train, y_train)
feature_imp = pd.Series(clf.feature_importances_, index = features_id).sort_values(ascending=False)
feature_imp_each_tree = [tree.feature_importances_.T for tree in clf.estimators_]

enter image description here enter image description here

然后,我知道功能的重要性#1: 0.1875, #2: 0.3313, #3: 0.4813

每棵树的特征重要性是

左树 #1: 0.375, #2: 05625, #3: 0.0625 正确的树 #1: 0, #2: 0.1, #3: 0.9

因此,我按照步骤进行计算...

feature #1 on left tree: (2/4)*(0.5-0-0)=0.25 feature #2 on left tree: (2/4)*(0.38-0-0)=0.38 feature #3 on left tree: (4/4)*(0.44-0.38*2/4-0.5*2/4)=0

feature #1 on right tree: 0 feature #2 on right tree: (5/5)*(0.28-0.38*3/5-0)=0.052 feature #3 on right tree: (3/5)*(0.38-0-0)=0.228

我知道对随机森林的重要性需要归一化为sum = 1 因此归一化后(重要性/总和)

sum = 0.25+0.38+0 = 0.44 feature #1 on left tree: 0.25/0.44 = 0.5682 feature #2 on left tree: 0.19/0.44 = 0.4318 feature #3 on left tree: 0/0.44 = 0

sum = 0+0.052+0.228 = 0.28 feature #1 on right tree: 0/0.28 = 0 feature #2 on right tree: 0.052/0.28 = 0.186 feature #3 on right tree: 0.228/0.28 = 0.814

然后计算平均值:

feature #1: (0.5682+0)/2 = 0.2841 feature #2: (0.4318+0.186)/2 = 0.3088 feature #3: (0+0.814)/2 = 0.4071

尽管排序正确但值不正确,但请帮助我如何计算谢谢

0 个答案:

没有答案