我遵循此问题来计算决策树上的特征重要性: scikit learn - feature importance calculation in decision trees
但是,在计算随机森林的特征重要性时,我无法获得正确的值。 例如: 我使用这样的代码并获得随机的林木。我在python上使用了sklearn包。
clf = RandomForestClassifier(n_estimators=2, max_features='log2')
clf.fit(X_train, y_train)
feature_imp = pd.Series(clf.feature_importances_, index = features_id).sort_values(ascending=False)
feature_imp_each_tree = [tree.feature_importances_.T for tree in clf.estimators_]
然后,我知道功能的重要性#1: 0.1875, #2: 0.3313, #3: 0.4813
每棵树的特征重要性是
左树 #1: 0.375, #2: 05625, #3: 0.0625
正确的树 #1: 0, #2: 0.1, #3: 0.9
因此,我按照步骤进行计算...
feature #1 on left tree: (2/4)*(0.5-0-0)=0.25
feature #2 on left tree: (2/4)*(0.38-0-0)=0.38
feature #3 on left tree: (4/4)*(0.44-0.38*2/4-0.5*2/4)=0
feature #1 on right tree: 0
feature #2 on right tree: (5/5)*(0.28-0.38*3/5-0)=0.052
feature #3 on right tree: (3/5)*(0.38-0-0)=0.228
我知道对随机森林的重要性需要归一化为sum = 1 因此归一化后(重要性/总和)
sum = 0.25+0.38+0 = 0.44
feature #1 on left tree: 0.25/0.44 = 0.5682
feature #2 on left tree: 0.19/0.44 = 0.4318
feature #3 on left tree: 0/0.44 = 0
sum = 0+0.052+0.228 = 0.28
feature #1 on right tree: 0/0.28 = 0
feature #2 on right tree: 0.052/0.28 = 0.186
feature #3 on right tree: 0.228/0.28 = 0.814
然后计算平均值:
feature #1: (0.5682+0)/2 = 0.2841
feature #2: (0.4318+0.186)/2 = 0.3088
feature #3: (0+0.814)/2 = 0.4071
尽管排序正确但值不正确,但请帮助我如何计算谢谢