Question

我想获得决策树上的重要功能（BaggingClasifier（estimator = DecisionTreeClassifier）。如果我使用数据集的所有功能（n = 8900）进行计算，那就是max_features = 1.0（浮动），我可以将它们右键索引。但是如果我将max_features更改为任何值（例如：181），则会根据max_features中使用的新编号给出索引。所以，我不知道哪些是实际的功能/重要功能在我的原始数据集上。

Hier是我的代码：

dt= BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=20, min_impurity_split=0.2,
            min_samples_leaf=6, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=7,
            splitter='best'),
         bootstrap=False, bootstrap_features=True, max_features=181,
         max_samples=1.0, n_estimators=3, n_jobs=2, oob_score=False,
         random_state=7, verbose=0, warm_start=False) #min_samples_leaf=10


# Fit the model

fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
trees = dt.estimators_
print(trees)

#--------------------
# Print the important features (way 1)

feature_importances = np.mean([
    tree.feature_importances_ for tree in dt.estimators_], axis=0)
print(feature_importances)


indices = np.argsort(feature_importances)[::-1]

print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], feature_importances[indices[f]]))

# Plotting the trees 
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz

for i in range(0,len(dt.estimators_)): # since in dt.estimators we have a list with all trees

    t = dt.estimators_[i]
    export_graphviz(t, out_file='tree.dot') # necessary to plot the graph

    dot_data = StringIO() # need to understand but it probably relates to read of strings
    export_graphviz(t, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

    img = Image(graph.create_png())
    print(dir(img)) # check what are the possibilities in graph.create_png

    with open("my_tree_" + str(i) + ".png", "wb") as png:
        png.write(img.data)

返回功能号，但不是原始数据的功能号和错误：

Error Message: IndexError: index 181 is out of bounds for axis 0 with size 181

使用所有功能的结果正常

 Feature ranking: (my code)
1. feature 976 (0.077076)
2. feature 2119 (0.071093)
3. feature 7481 (0.065344)
4. feature 9092 (0.042598)
5. feature 7986 (0.040946)
6. feature 9642 (0.039385)
7. feature 3032 (0.039291)
8. feature 4299 (0.038662)
9. feature 8334 (0.037809)
10. feature 363 (0.037768)

    Feature raking: (@akaran code, returns different results)
    8157    0.213513
    5406    0.081889
    1461    0.078714
    7085    0.059718
    3213    0.048554
    1901    0.039385
    1486    0.038662
    1470    0.037289
    8328    0.036474
    8349    0.027375

我将不胜感激。

Answer 1

嗯这有点棘手，因为在使用max_features时，每棵树都会选择相同数量的功能，例如181，但它们可能不同。

尝试用更简单的例子来解决这个问题：

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import pandas as pd

X, y = load_iris(return_X_y=True)

# num of features in this dataset is 4, using 2 for the example's purposes
clf = BaggingClassifier(DecisionTreeClassifier(), max_features=2)
clf.fit(X, y)

print clf.estimators_features_

# get the feature importances for each tree
feat_imp = [
    tree.feature_importances_ for tree in clf.estimators_
]

# create an empty dataframe with the unique set of features 
# selected by the trees as columns
df = pd.DataFrame([], columns=np.unique(clf.estimators_features_))

# iterate over the feature importances of each tree
for i in range(len(feat_imp)):
    # fill in the importances for each feature if it exists, else fill with 0
    for c in df.columns:
        df.loc[i, c] = feat_imp[i][np.where(clf.estimators_features_[i]==c)[0]][0] \
                        if c in clf.estimators_features_[i] \
                        else 0

# just to check the output
print df.head()

# get the mean for each feature sorted from the most important to the least important
print df.mean().sort_values(ascending=False)

这个的输出将是这样的：

(150, 4)  # 150 rows, 4 features

# 10 estimators by default - 10 estimators_features_ different for each estimator
[array([3, 1]), array([0, 2]), array([2, 3]), array([0, 1]), array([0, 3]), array([2, 1]), array([0, 3]), array([2, 1]), array([2, 1]), array([0, 1])]

# the top rows of the dataframe that has the features used as columns 
# and the importances as values wherever applicable, else 0
          0          1         2          3
0         0  0.0323746         0   0.967625
1  0.124173          0  0.875827          0
2         0          0  0.985627  0.0143726
3  0.603425   0.396575         0          0
4  0.101683          0         0   0.898317

# the final feature importances
# feature   importance
2    0.467784
3    0.281387
0    0.155752
1    0.095078

我不确定这是最有效的方法，因为你有很多功能。如果我想出更好的东西，我会重新考虑并编辑。

希望这有帮助，祝你好运！

在决策树中提取重要特征时如何返回正确的索引

1 个答案: