Question

使用scikit-learn中的nltk检查朴素贝叶斯分类器的准确性，我做错了什么？

...readFile definition not needed 
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]

#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))

#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))

#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)

predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)

问题是当我打印nltk.classify.accuracy时，它需要永远，我怀疑这是因为我做错了但是因为我没有错误，我无法弄清楚它是什么错了

Answer 1

请改用accuracy_score的sklearn metrics。

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5

我认为你正在混合有关监督学习的一些内容。
请参阅此answer并尝试理解top of this page。

您的数据应采用此格式（在进行矢量化之前）：

X = [["The cat is sleeping"], ..., ["The man is dead"]]
Y = [1, ..., 0]

Answer 2

至少在这一行中你有错误

clf = MultinomialNB（）。fit（X_train_tf，training_set.split（'\ n'））

您需要在那里拥有训练标签和矢量化数据，但是您拥有原始数据和矢量化数据。

它应该是这样的：

clf = MultinomialNB（）。fit（X_train_tf，y_train）

但据我所知，你甚至在代码中的任何地方都没有标签y_train数据。

评估NB模型的预测准确性

2 个答案: