我正在尝试在网站评论(3个类别)数据库上构建文本分类模型。 我清理了DF,对其进行了标记(使用countVectorizer)和Tfidf(TfidfTransformer),并建立了MNB模型。 现在,在我训练并评估了模型之后,我想获得错误预测的列表,以便我可以将其通过LIME并探索使模型困惑的词语。
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
我的目标是获取模型错误预测的评论的原始索引。
答案 0 :(得分:0)
似乎您的代码中还有另一个问题,通常TfIdf矢量化器仅适用于训练数据,并且为了获得相同格式的测试数据,我们进行了变换操作。这样做主要是为了避免数据泄漏。请参阅TfidfVectorizer: should it be used on train only or train+test。我已经修改了您的代码以满足您的需求。
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]
希望这会有所帮助!
答案 1 :(得分:0)
我设法得到了这样的结果:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
我肯定还有一种更优雅的方式...