我正试图在我的一组推文上使用多项式朴素贝叶斯分类。
这是我的代码:
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here
tags = ['Pro_vax','Anti_vax','Neither']
mnb = MultinomialNB()
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
文件train.txt
中包含以下文字:
Vaccines are a very good idea. They prevent all sorts of deadly diseases.
Vaccines cause autism. Do not vaccinate your children
Going to read about vaccines. Then, I am going to see my brother with autism.
我使用tags
变量标记了它们。
文件test.txt
包含以下文字:
Do not get your kids vaccinated. Vaccination and autism are correlated.
当我运行脚本时,出现以下错误:
ValueError: Found arrays with inconsistent numbers of samples: [3 9]
我不熟悉这个错误。它是什么意思,我怎样才能防止它再次弹出?