我正在尝试使用python中的“信息检索简介”(请参阅here(有关设置和数据,请参阅here)的Bernoulli分类器来重现NB文本分类的基本示例。使用nltk。 “ word_feats(words)”方法取自here。
代码如下:
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
# define training and test sets as lists
trainyes = [['Chinese', 'Beijing', 'Chinese'],
['Chinese', 'Chinese', 'Shanghai'],
['Chinese', 'Macao']]
trainno = [['Tokyo', 'Japan', 'Chinese']]
test = ['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']
# extract features from training and test sets
yesfeats = [(word_feats(wds), 'yes') for wds in trainyes]
nofeats = [(word_feats(wds), 'no') for wds in trainno]
trainfeats = nofeats + yesfeats
testfeats = word_feats(test)
# train the classifier
classifier = NaiveBayesClassifier.train(trainfeats)
# obtain predicted probabilities for "yes" and "no"
# for the test set
pctest = classifier.prob_classify(testfeats)
for label in pctest.samples():
print("probability of %s: %f" % (label, pctest.prob(label)))
对此有两个问题:
(1)这是使用nltk进行训练和分类的一种正确的方法(如果笨拙),也就是说,在输入数据和目标都给定的情况下,代码是否正确?
(2)分类器返回的预测概率不同 从示例中显示的那些。正如here所指出的,这是因为nltk并未实现Bernoulli Naive Bayes,而是“实现了多项Naive Bayes,但仅允许二进制特征”。如果有人对此提供进一步的说明,即在nltk朴素贝叶斯训练和分类过程中使用的条件概率公式是什么,我将不胜感激。