如何获得关于验证集的错误预测的列表

时间:2019-06-25 12:32:16

标签: python pandas scikit-learn

我正在尝试在网站评论(3个类别)数据库上构建文本分类模型。 我清理了DF,对其进行了标记(使用countVectorizer)和Tfidf(TfidfTransformer),并建立了MNB模型。 现在,在我训练并评估了模型之后,我想获得错误预测的列表,以便我可以将其通过LIME并探索使模型困惑的词语。

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
    labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)

x = cleaned_df["review_text"]
y = cleaned_df["business_category"]

# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)

#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)

# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
    tfidf_x, y, test_size=0.3, random_state=101
)

mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)

predmnb = mnb.predict(x_test)

我的目标是获取模型错误预测的评论的原始索引。

2 个答案:

答案 0 :(得分:0)

似乎您的代码中还有另一个问题,通常TfIdf矢量化器仅适用于训练数据,并且为了获得相同格式的测试数据,我们进行了变换操作。这样做主要是为了避免数据泄漏。请参阅TfidfVectorizer: should it be used on train only or train+test。我已经修改了您的代码以满足您的需求。

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
    labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)

x = cleaned_df["review_text"]
y = cleaned_df["business_category"]

# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=101
)


transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)



mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)

predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

希望这会有所帮助!

答案 1 :(得分:0)

我设法得到了这样的结果:

predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]

我肯定还有一种更优雅的方式...