文本分类器总是预测最大的分类

时间:2019-04-02 21:08:27

标签: python machine-learning scikit-learn

我正在尝试根据文字评论建立预测模型。因此,根据之前的评论,我将猜测一个产品将获得多少颗星星(1,2,3,4,5)。

我遵循scikit tutorial on text data,但是我的模型始终预测5星评级,成功率为66%。

如何确保我的模型不会每次都简单地预测最大类?

以下是数据(700MB):Movies and TV 5-core (1,697,533 reviews)

这是我的数据子集(1MB):Movies and TV first 1000 rows

我使用前1000行进行测试,当我添加更多的预测时,预测会变得更糟,对于10000行,得分为0.6。

前1000行的评分分布:

5    678
4    133
1     70
3     69
2     50

这是我的代码:

import pandas as pd
import numpy as np

# Select columns
df = data[['reviewText','overall']]

# Make a smaller set while creating model

df_small = df.head(1000)

# Train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_small[['reviewText']], df_small[['overall']], 
    test_size=0.1, random_state=42)

X_train = X_train.values.ravel() # https://stackoverflow.com/a/26367429
X_test = X_test.values.ravel()
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train) 

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Fit

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

# Test

docs_new = X_test
X_new_counts = vectorizer.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

np.mean(predicted == y_test)  

输出:0.66

1 个答案:

答案 0 :(得分:0)

您是否尝试过Stratified Sampling,该课程将您的课程按比例分配给训练和测试集。

此外,尝试查看F1 Score和您的ROC AUC Score

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=42)

for train_index, test_index in splitter.split(df_small[['reviewText']], df_small[['overall']]):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]