Question

我正在尝试在网站评论（3个类别）数据库上构建文本分类模型。我清理了DF，对其进行了标记（使用countVectorizer）和Tfidf（TfidfTransformer），并建立了MNB模型。现在，在我训练并评估了模型之后，我想获得错误预测的列表，以便我可以将其通过LIME并探索使模型困惑的词语。

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
    labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)

x = cleaned_df["review_text"]
y = cleaned_df["business_category"]

# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)

#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)

# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
    tfidf_x, y, test_size=0.3, random_state=101
)

mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)

predmnb = mnb.predict(x_test)

我的目标是获取模型错误预测的评论的原始索引。

Answer 1

似乎您的代码中还有另一个问题，通常TfIdf矢量化器仅适用于训练数据，并且为了获得相同格式的测试数据，我们进行了变换操作。这样做主要是为了避免数据泄漏。请参阅TfidfVectorizer: should it be used on train only or train+test。我已经修改了您的代码以满足您的需求。

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
    labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)

x = cleaned_df["review_text"]
y = cleaned_df["business_category"]

# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=101
)


transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)



mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)

predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

希望这会有所帮助！

Answer 2

我设法得到了这样的结果：

predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]

我肯定还有一种更优雅的方式...

如何获得关于验证集的错误预测的列表

2 个答案: