评估NB模型的预测准确性

时间:2016-04-16 16:21:58

标签: python-2.7 scikit-learn nltk

使用scikit-learn中的nltk检查朴素贝叶斯分类器的准确性,我做错了什么?

...readFile definition not needed 
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]

#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))

#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))

#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)

predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)

问题是当我打印nltk.classify.accuracy时,它需要永远,我怀疑这是因为我做错了但是因为我没有错误,我无法弄清楚它是什么错了

2 个答案:

答案 0 :(得分:1)

请改用accuracy_scoresklearn metrics

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5

我认为你正在混合有关监督学习的一些内容。
请参阅此answer并尝试理解top of this page

您的数据应采用此格式(在进行矢量化之前):

X = [["The cat is sleeping"], ..., ["The man is dead"]]
Y = [1, ..., 0] 

答案 1 :(得分:0)

至少在这一行中你有错误

clf = MultinomialNB()。fit(X_train_tf,training_set.split('\ n'))

您需要在那里拥有训练标签和矢量化数据,但是您拥有原始数据和矢量化数据。

它应该是这样的:

clf = MultinomialNB()。fit(X_train_tf,y_train)

但据我所知,你甚至在代码中的任何地方都没有标签y_train数据。