使用scikit-learn
中的nltk
检查朴素贝叶斯分类器的准确性,我做错了什么?
...readFile definition not needed
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]
#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))
#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)
predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)
问题是当我打印nltk.classify.accuracy时,它需要永远,我怀疑这是因为我做错了但是因为我没有错误,我无法弄清楚它是什么错了
答案 0 :(得分:1)
请改用accuracy_score的sklearn metrics。
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
我认为你正在混合有关监督学习的一些内容。
请参阅此answer并尝试理解top of this page。
您的数据应采用此格式(在进行矢量化之前):
X = [["The cat is sleeping"], ..., ["The man is dead"]]
Y = [1, ..., 0]
答案 1 :(得分:0)
至少在这一行中你有错误
clf = MultinomialNB()。fit(X_train_tf,training_set.split('\ n'))
您需要在那里拥有训练标签和矢量化数据,但是您拥有原始数据和矢量化数据。
它应该是这样的:
clf = MultinomialNB()。fit(X_train_tf,y_train)
但据我所知,你甚至在代码中的任何地方都没有标签y_train数据。