使用管道在文本数据上出现SHAP KernelExplainer错误

时间:2018-11-07 05:19:32

标签: python pipe pipeline

我一直在浏览Python的SHAP软件包,但没有发现使用KernelExplainer解释文本数据预测的示例,因此我决定使用在https://www.superdatascience.com/machine-learning/上找到的数据集对其进行测试。

我在KernelExplainer的最后一部分遇到了问题,我认为问题在于我将数据和模型输入到解释器的方式。

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

有人可以建议我修改我的内容,以使解释器正常工作吗?我在最后一刻花了几个小时,但无济于事。任何帮助或建议,我们将不胜感激。非常感谢!

数据集:https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import nltk

#Load the data
os.chdir('C:\\Users\\Win\\Desktop\\MyLearning\\Explainability\\SHAP')
review = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')

#Clean the data
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def clean_text(df_text_column, data):   
    corpus = []
    for i in range(0, len(data)):
        text = re.sub('[^a-zA-Z]', ' ', df_text_column[i])
        text = text.lower()
        text = text.split()
        ps = PorterStemmer()
        text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
        text = ' '.join(text)
        corpus.append(text)
    return corpus

X = pd.DataFrame({'Review':clean_text(review['Review'],review)})['Review']
y = review['Liked']

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Creating the pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer() 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
from sklearn.pipeline import make_pipeline
np.random.seed(0)
rf_pipe = make_pipeline(vect, rf)
rf_pipe.steps
rf_pipe.fit(X_train, y_train)

y_pred = rf_pipe.predict(X_test)
y_prob = rf_pipe.predict_proba(X_test)

#Performance Metrics
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) #Accuracy
metrics.roc_auc_score(y_test, y_prob[:, 1]) #ROC-AUC score

# use Kernel SHAP to explain test set predictions
import shap
explainer = shap.KernelExplainer(rf_pipe.predict_proba, X_train, link="logit")
shap_values = explainer.shap_values(X_test, nsamples=100)

# plot the SHAP values
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], X_test.iloc[0,:], link="logit")

0 个答案:

没有答案