在python中使用朴素贝叶斯的文本分类

时间:2017-09-25 12:28:37

标签: python machine-learning naivebayes

我创建了一个模型,我在其中运行Naive Bayes以获得预期的输出。

from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
('Agree Completely Agree Strongly Agree Somewhat Disagree Somewhat Disagree Strongly Completely Disagree','TRUE'),
('Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
('1 - disagree strongly 2 - disagree somewhat 3 - neither agree nor disagree 4 - agree somewhat 5 - agree strongly','TRUE'),
('1 - doesn\'t apply at all 2 3 4 5 6 7 - applies completely','TRUE'),
('1 - extremely new and different 2 3 4 5 6 7 - not at all new & different','TRUE'),
('1 - extremely relevant 2 3 4 5 6 7 - not at all relevant','TRUE'),
('1 - I don\'t want brands to engage with me at all on social media 2 3 4 5 6 7 - I love to engage with brands on social media','TRUE'),
    ('1 - Most Important 2 3 4 5 - Least Important','TRUE'),    
    ('pepsi','FALSE'),
    ('coca cola','FALSE'),
    ('hyundai','FALSE'),        
    ('Audio quality','FALSE'),
    ('Product features ','FALSE'),
    ('Content ','FALSE')
]
test_corpus = [
    ('1 - Agree Completely 2 - Agree Strongly 3 - Agree Somewhat 4 - Disagree Somewhat 5 - Disagree Strongly 6 - Completely Disagree','TRUE'),
    ('1 - Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
    ('Content ','FALSE'),
    ('Ease of navigation','FALSE')
]
model = NBC(training_corpus) 
print(model.classify('pepsi'))
print(model.accuracy(test_corpus)*100)

当我运行此代码时,它显示100%的效率,但每次都返回FALSE。我不确定是什么问题,但这不是预期的输出。

1 个答案:

答案 0 :(得分:0)

您的型号还可以,它只是您的数据和分类器 我的意思是通过训练您提供的数据,它运作良好,让我们进行一些测试:

def test(s):
    prob_dist = model.prob_classify(s)
    print("classifiying", s)
    print("possibility of being FALSE:", round(prob_dist.prob("FALSE"), 2), 
          "possibility of being TRUE:" ,round(prob_dist.prob("TRUE"), 2))
    print('-'*70)

test_cases = ['1', '1 - ', '2', '2 3 4 5', '1- 2 3 4 5', 'pepsi', 'coca', 'BMW']
for tc in test_cases:
    test(tc)

现在这里是输出,它非常好,

classifiying 1
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 1 - 
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2 3 4 5
possibility of being FALSE: 0.05 possibility of being TRUE: 0.95
----------------------------------------------------------------------
classifiying 1- 2 3 4 5
possibility of being FALSE: 0.0 possibility of being TRUE: 1.0
----------------------------------------------------------------------
classifiying pepsi
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying coca
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying BMW
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
--------------------------------------------------------------------

好的,现在你想知道为什么分类器会这样吗? 看看你的代码,你在哪里提到过特征向量? no where,因此它使用默认函数将特征向量提取为explained here。 (你可以看一下source code

例如,您可以看到模型特征:

model.show_informative_features()


>>> Most Informative Features
             contains(4) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(3) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(5) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(2) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(1) = False           FALSE : TRUE   =      3.3 : 1.0
             contains(7) = False           FALSE : TRUE   =      2.4 : 1.0
             contains(6) = False           FALSE : TRUE   =      2.4 : 1.0
            contains(at) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(all) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(not) = False           FALSE : TRUE   =      1.3 : 1.0