使用nltk复制Bernoulli Naive Bayes分类示例

时间:2018-08-01 16:38:16

标签: python nltk stanford-nlp naivebayes

我正在尝试使用python中的“信息检索简介”(请参阅​​here(有关设置和数据,请参阅here)的Bernoulli分类器来重现NB文本分类的基本示例。使用nltk。 “ word_feats(words)”方法取自here

代码如下:

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
    return dict([(word, True) for word in words])

# define training and test sets as lists
trainyes = [['Chinese', 'Beijing', 'Chinese'],
            ['Chinese', 'Chinese', 'Shanghai'],
            ['Chinese', 'Macao']]
trainno = [['Tokyo', 'Japan', 'Chinese']]
test = ['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']

# extract features from training and test sets
yesfeats = [(word_feats(wds), 'yes') for wds in trainyes]
nofeats = [(word_feats(wds), 'no') for wds in trainno]
trainfeats = nofeats + yesfeats
testfeats = word_feats(test)

# train the classifier
classifier = NaiveBayesClassifier.train(trainfeats)

# obtain predicted probabilities for "yes" and "no"
# for the test set
pctest = classifier.prob_classify(testfeats)
for label in pctest.samples():
    print("probability of %s: %f" % (label, pctest.prob(label)))

对此有两个问题:

(1)这是使用nltk进行训练和分类的一种正确的方法(如果笨拙),也就是说,在输入数据和目标都给定的情况下,代码是否正确?

(2)分类器返回的预测概率不同 从示例中显示的那些。正如here所指出的,这是因为nltk并未实现Bernoulli Naive Bayes,而是“实现了多项Naive Bayes,但仅允许二进制特征”。如果有人对此提供进一步的说明,即在nltk朴素贝叶斯训练和分类过程中使用的条件概率公式是什么,我将不胜感激。

0 个答案:

没有答案