Question

我正在使用scikit extra trees分类器：

model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)

一旦模型拟合并用于预测类，我想找出每个特征对特定类预测的贡献。我如何在scikit中学习呢？是否可以使用额外的树分类器，还是需要使用其他模型？

Answer 1

更新

今天比2.5年前更了解ML，我现在会说这种方法只适用于高度线性的决策问题。如果你不小心将它应用于非线性问题，你将遇到麻烦。

示例：想象一个功能，既没有非常大的值也没有非常小的值预测类，但是某个中间间隔的值确实如此。这可能是饮水以预测脱水。但是水的摄入可能与盐的摄入相互作用，因为多吃盐可以增加水的摄入量。现在，您可以在两个非线性要素之间进行交互。决策边界蜿蜒在您的特征空间周围，以模拟这种非线性，并仅询问其中一个特征影响脱水风险的程度根本无知。这不是正确的问题。

替代方案：您可以提出的另一个更有意义的问题是：如果我没有这些信息（如果我遗漏了这个功能），我对给定标签的预测会受到多大影响？要做到这一点，您只需省略一项功能，训练模型并查看每个班级的精确度和召回率。它仍然告知了特征重要性，但它没有对线性度做出任何假设。

以下是旧答案。

我前段时间遇到了类似的问题并发布了same question on Cross Validated。 简短的回答是，sklearn中没有实现所有您想要的功能。

但是，您要实现的目标非常简单，可以通过将每个类的每个特征拆分的平均标准化平均值乘以相应的model._feature_importances数组元素来完成。您可以编写一个简单的函数来标准化数据集，计算跨类预测的每个特征拆分的平均值，并使用model._feature_importances数组进行元素乘法。绝对结果值越大，特征对其预测类越重要，更好的是，符号将告诉您它是重要的小值还是大值。

这是一个超级简单的实现，它采用数据矩阵X，预测列表Y和一系列要素重要性，并输出描述每个要素重要性的JSON每个班级。

def class_feature_importance(X, Y, feature_importances):
    N, M = X.shape
    X = scale(X)

    out = {}
    for c in set(Y):
        out[c] = dict(
            zip(range(N), np.mean(X[Y==c, :], axis=0)*feature_importances)
        )

    return out

示例：的

import numpy as np
import json
from sklearn.preprocessing import scale

X = np.array([[ 2,  2,  2,  0,  3, -1],
              [ 2,  1,  2, -1,  2,  1],
              [ 0, -3,  0,  1, -2,  0],
              [-1, -1,  1,  1, -1, -1],
              [-1,  0,  0,  2, -3,  1],
              [ 2,  2,  2,  0,  3,  0]], dtype=float)

Y = np.array([0, 0, 1, 1, 1, 0])
feature_importances = np.array([0.1, 0.2, 0.3, 0.2, 0.1, 0.1])
#feature_importances = model._feature_importances

result = class_feature_importance(X, Y, feature_importances)

print json.dumps(result,indent=4)

{
    "0": {
        "0": 0.097014250014533204, 
        "1": 0.16932975630904751, 
        "2": 0.27854300726557774, 
        "3": -0.17407765595569782, 
        "4": 0.0961523947640823, 
        "5": 0.0
    }, 
    "1": {
        "0": -0.097014250014533177, 
        "1": -0.16932975630904754, 
        "2": -0.27854300726557779, 
        "3": 0.17407765595569782, 
        "4": -0.0961523947640823, 
        "5": 0.0
    }
}

result中的第一级键是类标签，第二级键是列索引，即特征索引。回想一下，大的绝对值对应于重要性，符号告诉你它是小的（可能是负的）还是重要的大值。

Answer 2

这是从docs

修改的

from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

iris = datasets.load_iris()  #sample data
X, y = iris.data, iris.target

model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)
model.fit_transform(X,y) # fit the dataset to your model

我认为feature_importances_正是您所寻找的：

In [13]: model.feature_importances_
Out[13]: array([ 0.09523045,  0.05767901,  0.40150422,  0.44558631])

编辑

也许我第一次误解（赏金前），对不起，这可能更符合您的要求。有一个名为treeinterpreter的python库可以生成我认为您正在寻找的信息。您必须使用基本DecisionTreeClassifer（或回归量）。从this blog post开始，您可以在每个实例的预测中离散地访问要素贡献：

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier

from treeinterpreter import treeinterpreter as ti

iris = datasets.load_iris()  #sample data
X, y = iris.data, iris.target
#split into training and test 
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, test_size=0.33, random_state=0)

# fit the model on the training set
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train,y_train)

为了便于说明，我将在X_test中迭代每个示例，这几乎完全模仿了上面的博客文章：

for test_sample in range(len(X_test)):
    prediction, bias, contributions = ti.predict(model, X_test[test_sample].reshape(1,4))
    print "Class Prediction", prediction
    print "Bias (trainset prior)", bias

    # now extract contributions for each instance
    for c, feature in zip(contributions[0], iris.feature_names):
        print feature, c

    print '\n'

循环的第一次迭代产生：

Class Prediction [[ 0.  0.  1.]]
Bias (trainset prior) [[ 0.34  0.31  0.35]]
sepal length (cm) [ 0.  0.  0.]
sepal width (cm) [ 0.  0.  0.]
petal length (cm) [ 0.         -0.43939394  0.43939394]
petal width (cm) [-0.34        0.12939394  0.21060606]

解释这个输出，似乎花瓣长度和花瓣宽度是预测第三类（对于第一个样本）最重要的贡献者。希望这会有所帮助。

Answer 3

论文 "Why Should I Trust You?": Explaining the Predictions of Any Classifier是在这个问题发布9天后提交的，为这个问题的一般解决方案提供了算法！： - ）

简而言之，它被称为“本地可解释的模型无关解释”的LIME，并通过在您想要理解的预测周围拟合一个更简单的局部模型来工作。

更重要的是，他们已经制作了一个python实现（https://github.com/marcotcr/lime），其中包含有关如何将其与sklearn一起使用的非常详细的示例。例如，this one是关于文本数据的两类随机森林问题，而this one是关于连续和分类的特征。它们都可以通过github上的README找到。

2016年作者在这一领域取得了非常富有成效的一年，所以如果你喜欢阅读论文，这里有一个首发：

Answer 4

到目前为止，我一直在检查eli5和treeinterpreter（之前已经提到过），我认为eli5将是最有帮助的，因为我认为有更多选项，更通用和更新。

然而，经过一段时间后，我对特定情况应用了eli5，我无法获得ExtraTreesClassifier研究更多的负面贡献，我意识到我已经获得了here的重要性或重量。因为我对贡献这样的东西更感兴趣，正如这个问题的标题所提到的，我理解某些特征可能会产生负面影响，但是当测量符号的重要性并不重要时，所以具有正面效果和负面影响的特征被组合在一起。

因为我对这个标志非常感兴趣，所以我做了如下： 1）获得所有案件的捐款 2）所有结果的agreage能够区分相同。没有非常优雅的解决方案，可能有更好的东西，我在这里发布以防万一它有帮助。

我重现previous post。

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import  (ExtraTreesClassifier, RandomForestClassifier, 
                              AdaBoostClassifier, GradientBoostingClassifier)
import eli5


iris = datasets.load_iris()  #sample data
X, y = iris.data, iris.target
#split into training and test 
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, test_size=0.33, random_state=0)

# fit the model on the training set
#model = DecisionTreeClassifier(random_state=0)
model = ExtraTreesClassifier(n_estimators= 100)

model.fit(X_train,y_train)


aux1 = eli5.sklearn.explain_prediction.explain_prediction_tree_classifier(model,X[0], top=X.shape[1])

aux1

输出

之前的结果适用于一个案例我想要全部运行并创建一个平均值：

这是具有结果的数据框的样子：

aux1 = eli5.sklearn.explain_prediction.explain_prediction_tree_classifier(model,X[0], top=X.shape[0])
aux1 = eli5.format_as_dataframe(aux1)
# aux1.index = aux1['feature']
# del aux1['target']
aux


target  feature weight  value
0   0   <BIAS>  0.340000    1.0
1   0   x3  0.285764    0.2
2   0   x2  0.267080    1.4
3   0   x1  0.058208    3.5
4   0   x0  0.048949    5.1
5   1   <BIAS>  0.310000    1.0
6   1   x0  -0.004606   5.1
7   1   x1  -0.048211   3.5
8   1   x2  -0.111974   1.4
9   1   x3  -0.145209   0.2
10  2   <BIAS>  0.350000    1.0
11  2   x1  -0.009997   3.5
12  2   x0  -0.044343   5.1
13  2   x3  -0.140554   0.2
14  2   x2  -0.155106   1.4

所以我创建了一个函数来组合以前的表格：

def concat_average_dfs(aux2,aux3):
    # Putting the same index together
#     I use the try because I want to use this function recursive and 
#     I could potentially introduce dataframe with those indexes. This
#     is not the best way.
    try:
        aux2.set_index(['feature', 'target'],inplace = True)
    except:
        pass
    try:
        aux3.set_index(['feature', 'target'],inplace = True)
    except:
        pass
    # Concatenating and creating the meand
    aux = pd.DataFrame(pd.concat([aux2['weight'],aux3['weight']]).groupby(level = [0,1]).mean())
    # Return in order
    #return aux.sort_values(['weight'],ascending = [False],inplace = True)
    return aux
aux2 = aux1.copy(deep=True)
aux3 = aux1.copy(deep=True)

concat_average_dfs(aux3,aux2)

所以现在我只需要使用我希望的所有示例的前一个函数。我将把整个人口不仅仅是训练集。检查所有实际案例中的平均效果

for i in range(X.shape[0]):


    aux1 = eli5.sklearn.explain_prediction.explain_prediction_tree_classifier(model,X\[i\], top=X.shape\[0\])
    aux1 = eli5.format_as_dataframe(aux1)

    if 'aux_total'  in locals() and 'aux_total' in  globals():
        aux_total = concat_average_dfs(aux1,aux_total)
    else:
        aux_total = aux1

结果：

Las table显示了每个特征对我所有真实人口的平均影响。

my github中的伴随笔记本。

Answer 5

正如@thorbjornwolf所示，存在一个称为LIME的方法（包括一个Python库）来解决此问题。解决这个问题的另一个库是SHAP，用于Shapley值。这两个库都看起来可行，并提供了解决此问题的完整解决方案。

使用scikit确定每个要素对特定类预测的贡献

5 个答案:

更新