我正在尝试使用朴素贝叶斯(Naive Bayes)来预测书评是正面的还是负面的(正面=大于3星,负面=小于或等于3星)。
我不知道问题出在哪里,所以我基本上已经复制了this tutorial,但是仍然无法正常工作。它仍然预测所有评论都是正面的。我想念什么?
以下是输入数据:
train_data[:2] -> [
str('julie strain fan collection photo page worth nice section painting look heavy literary content place find page text everything else line want one book six foot one probably good choice however like julie like like julie go wrong one either'),
str('care much seuss read philip nel book change mind good testimonial power rel write think rel play seuss ultimate compliment treat serious poet well one century interesting visual artist read book decide trip mandeville collection library university california san diego order could visit incredible holding almost much take like william butler yeats seuss lead career constantly shift metamoprhized meet new historical political cirsumstances seem leftist conservative different juncture career politics art nel show u cartoonist fabled pm magazine like andy warhol serve time slave ad business service amuse broaden mind u child nel hesitate administer sound spank seuss industry since death see fit license kind awful product include recent cat hat film mike myers oh book great especially recommend work picture editor give u bounty good illustration')
]
train_labels[:2] -> [1 1]
这是我的代码(或者您可以在我的gitlab上找到数据):
def read_data_from_file(readFileName):
with open(str(readFileName), "r", encoding='UTF-8') as f:
readedList = json.loads(f.read())
return readedList
DATASET_PATH = Path('data/Naive_Bayes_data')
SENTENCE_LIST = Path(r'NB_sentence_list.txt')
LABEL_LIST = Path(r'label_list.txt')
sentence_list = read_data_from_file(DATASET_PATH.joinpath(SENTENCE_LIST))
labels = read_data_from_file(DATASET_PATH.joinpath(LABEL_LIST))
# modify the input data
data = [' '.join(sentence) for sentence in sentence_list]
# 0 - negative
# 1 - positive
labels_np = np.asarray(labels, dtype=np.int64)
labels_np[labels_np <= 3] = 0
labels_np[labels_np > 3] = 1
train_data = data[:7000]
train_labels = labels_np[:7000]
test_data = data[7000:]
test_labels = labels_np[7000:]
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_labels)
X_test_counts = count_vect.transform(test_data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_test_tfidf)
print(metrics.classification_report(test_labels, predicted))
分类报告的结果是:
precision recall f1-score support
0 0.00 0.00 0.00 610
1 0.80 1.00 0.89 2391
avg / total 0.63 0.80 0.71 3001
我不知道我在这里缺少什么...