Question

我正在使用sklearn的线性SVC模型进行文本分类。现在，我想通过使用SHAP（https://github.com/slundberg/shap）来可视化哪些单词/标记对分类决策的影响最大。

现在这不起作用，因为我收到一个错误，该错误似乎源于我定义的管道中的矢量化器步骤-这是怎么回事？

在这种情况下，我关于使用SHAP的一般方法是否正确？

x_Train, x_Test, y_Train, y_Test = train_test_split(df_all['PDFText'], df_all['class'], test_size = 0.2, random_state = 1234)

pipeline = Pipeline([
    (
        'tfidv',
        TfidfVectorizer(
            ngram_range=(1,3), 
            analyzer='word',
            strip_accents = ascii,
            use_idf = True,
            sublinear_tf=True, 
            max_features=6000, 
            min_df=2, 
            max_df=1.0
        )
    ),
    (
        'lin_svc',
        svm.SVC(
            C=1.0,
            probability=True,
            kernel='linear'
        )
    )
])

pipeline.fit(x_Train, y_Train)

shap.initjs()

explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
shap_values = explainer.shap_values(x_Test, nsamples=100)

shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], x_Test.iloc[0,:])

这是我收到的错误消息：

Provided model function fails when applied to the provided data set.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-81-4bca63616b3b> in <module>
      3 
      4 # use Kernel SHAP to explain test set predictions
----> 5 explainer = shap.KernelExplainer(pipeline.predict_proba, x_Train)
      6 shap_values = explainer.shap_values(x_Test, nsamples=100)
      7 

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\shap\explainers\kernel.py in __init__(self, model, data, link, **kwargs)
     95         self.keep_index_ordered = kwargs.get("keep_index_ordered", False)
     96         self.data = convert_to_data(data, keep_index=self.keep_index)
---> 97         model_null = match_model_to_data(self.model, self.data)
     98 
     99         # enforce our current input type limitations

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\shap\common.py in match_model_to_data(model, data)
     80             out_val = model.f(data.convert_to_df())
     81         else:
---> 82             out_val = model.f(data.data)
     83     except:
     84         print("Provided model function fails when applied to the provided data set.")

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    116 
    117         # lambda, but not partial, allows help() to work with update_wrapper
--> 118         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    119         # update the docstring of the returned function
    120         update_wrapper(out, self.fn)

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\pipeline.py in predict_proba(self, X)
    379         for name, transform in self.steps[:-1]:
    380             if transform is not None:
--> 381                 Xt = transform.transform(Xt)
    382         return self.steps[-1][-1].predict_proba(Xt)
    383 

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents, copy)
   1631         check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
   1632 
-> 1633         X = super(TfidfVectorizer, self).transform(raw_documents)
   1634         return self._tfidf.transform(X, copy=False)

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
   1084 
   1085         # use the same matrix-building strategy as fit_transform
-> 1086         _, X = self._count_vocab(raw_documents, fixed_vocab=True)
   1087         if self.binary:
   1088             X.data.fill(1)

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    940         for doc in raw_documents:
    941             feature_counter = {}
--> 942             for feature in analyze(doc):
    943                 try:
    944                     feature_idx = vocabulary[feature]

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    326                                                tokenize)
    327             return lambda doc: self._word_ngrams(
--> 328                 tokenize(preprocess(self.decode(doc))), stop_words)
    329 
    330         else:

c:\users\stefan.prenner\appdata\local\programs\python\python37\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
    254 
    255         if self.lowercase:
--> 256             return lambda x: strip_accents(x.lower())
    257         else:
    258             return strip_accents

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Answer 1

KernelExplainer希望收到一个分类模型作为第一个参数。请检查link之后是否使用带有Shap的Pipeline。

根据您的情况，您可以按以下方式使用管道：

x_Train = pipeline.named_steps['tfidv'].fit_transform(x_Train)
explainer = shap.KernelExplainer(pipeline.named_steps['lin_svc'].predict_proba, x_Train)

如何在sklearn的线性SVC模型中使用SHAP？

1 个答案: