如何从训练有素的随机森林中找到关键的树木/特征?

时间:2013-06-12 03:14:56

标签: scikit-learn

我正在使用Scikit-Learn随机森林分类器并尝试提取有意义的树/特征,以便更好地理解预测结果。

我发现这个方法似乎与文档(http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params)相关,但找不到如何使用它的示例。

我也希望在可能的情况下可视化这些树,任何相关的代码都会很棒。

谢谢!

3 个答案:

答案 0 :(得分:16)

我认为你正在寻找Forest.feature_importances_。这使您可以了解每个输入要素对最终模型的相对重要性。这是一个简单的例子。

import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier 


#Lets set up a training dataset.  We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1.  We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features.  If we do it right, the model should point out these three as important.  
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
    line = []
    if random.random()>0.5:
        line.append(1.0)
        #Let's add 3 features that we know indicate a row classified as "1".
        line.append(.77)
        line.append(.33)
        line.append(.55)
        for x in range(16):#fill in the rest with noise
            line.append(random.random())
    else:
        #this is a "0" row, so fill it with noise.
        line.append(0.0)
        for x in range(19):
            line.append(random.random())        
    train_data.append(line)
train_data = np.array(train_data)


# Create the random forest object which will include all the parameters
# for the fit.  Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)

# Fit the training data to the training output and create the decision
# trees.  This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])

#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one.  Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
    if i>np.average(Forest.feature_importances_):
        important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!

答案 1 :(得分:6)

要获取相对要素重要性,请阅读relevant section of the documentation以及同一部分中已链接示例的代码。

树本身存储在随机林实例的estimators_属性中(仅在调用fit方法之后)。现在要提取一个“关键树”,首先要求你定义它是什么以及你期望用它做什么。

你可以通过在持有的测试集上计算得分来对各个树进行排名,但我不知道期望从中获得什么。

您是否希望修剪森林以通过减少树木数量而不降低聚合森林精度来更快地进行预测?

答案 2 :(得分:1)

这是我如何可视化树:

首先在完成所有预处理,拆分等后制作模型:

# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

做出预测:

# Predicting the Test set results
y_pred = classifier.predict(X_test)

然后制作重要图。变量dataset是原始数据框的名称。

# get importances from RF
importances = classifier.feature_importances_

# then sort them descending
indices = np.argsort(importances)

# get the features from the original data set
features = dataset.columns[0:26]

# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

这产生如下图:

enter image description here