Question

我正在尝试使用python中的“信息检索简介”（请参阅here（有关设置和数据，请参阅here）的Bernoulli分类器来重现NB文本分类的基本示例。使用nltk。 “ word_feats（words）”方法取自here。

代码如下：

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
    return dict([(word, True) for word in words])

# define training and test sets as lists
trainyes = [['Chinese', 'Beijing', 'Chinese'],
            ['Chinese', 'Chinese', 'Shanghai'],
            ['Chinese', 'Macao']]
trainno = [['Tokyo', 'Japan', 'Chinese']]
test = ['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']

# extract features from training and test sets
yesfeats = [(word_feats(wds), 'yes') for wds in trainyes]
nofeats = [(word_feats(wds), 'no') for wds in trainno]
trainfeats = nofeats + yesfeats
testfeats = word_feats(test)

# train the classifier
classifier = NaiveBayesClassifier.train(trainfeats)

# obtain predicted probabilities for "yes" and "no"
# for the test set
pctest = classifier.prob_classify(testfeats)
for label in pctest.samples():
    print("probability of %s: %f" % (label, pctest.prob(label)))

对此有两个问题：

（1）这是使用nltk进行训练和分类的一种正确的方法（如果笨拙），也就是说，在输入数据和目标都给定的情况下，代码是否正确？

（2）分类器返回的预测概率不同从示例中显示的那些。正如here所指出的，这是因为nltk并未实现Bernoulli Naive Bayes，而是“实现了多项Naive Bayes，但仅允许二进制特征”。如果有人对此提供进一步的说明，即在nltk朴素贝叶斯训练和分类过程中使用的条件概率公式是什么，我将不胜感激。

使用nltk复制Bernoulli Naive Bayes分类示例

0 个答案: