用于文本分类的nltk naivebayes分类器

时间:2016-09-06 14:38:23

标签: machine-learning nlp nltk text-classification document-classification

在下面的代码中,我知道我的naivebayes分类器工作正常,因为它在trainset1上正常工作,但为什么它不能在trainset2上工作?我甚至尝试过两个分类器,一个来自TextBlob,另一个来自nltk。

"Total Amount"

输出:

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk

trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
         ('hello i was there and no one came', 'class2'),
         ('all negative terms like sad angry etc', 'class2')]

def nltk_naivebayes(trainset, test_sentence):
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
    classifier = nltk.NaiveBayesClassifier.train(t)
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
    return classifier.classify(test_sent_features)

def textblob_naivebayes(trainset, test_sentence):
    cl = NaiveBayesClassifier(trainset)
    blob = TextBlob(test_sentence,classifier=cl)
    return blob.classify() 

test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"

print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)

虽然test_sentence2显然属于class1。

1 个答案:

答案 0 :(得分:4)

我会假设你理解你不能指望分类器只用3个例子来学习一个好的模型,而你的问题更多的是要理解为什么它在这个特定的例子中做到了。

这样做的可能原因是朴素贝叶斯分类器使用先前的类概率。也就是说,无论文本如何,neg与pos的概率。在你的情况下,2/3的例子是否定的,因此对于neg,先验为66%,对于pos为33%。单个正面实例中的积极词汇是“周年纪念日”和“飙升”,这不足以弥补这个先前的阶级概率。

特别要注意,单词概率的计算涉及各种“平滑”功能(例如,每个类中的log10(Term Frequency + 1),而不是log10(Term Frequency),以防止低频词到分类结果影响太大,除以零等等。因此,“周年纪念”和“飙升”的概率对于负值不是0.0而对于pos而言是1.0,与您可能预期的不同。