NLTK朴素贝叶斯分类错误

时间:2015-09-22 23:23:36

标签: python nltk naivebayes

错误讯息:

追踪(最近一次通话):   File" /Users/ABHINAV/Documents/test2.py" ;,第58行,在     classifier = NaiveBayesClassifier.train(trainfeats)   火车" /Library/Python/2.7/site-packages/nltk/classify/naivebayes.py" ;,第194行,在火车上     对于featureset,labeled_featuresets中的标签: ValueError:要解压缩的值太多 [在17.0s完成,退出代码为1]

当我尝试在一组数据上实现朴素贝叶斯时,我收到此错误。这是代码:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
    return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4


trainfeats=[('good'),('pos'),
('quick'),('pos'),
('easy'),('pos'),
('big'),('pos'),
('iterested'),('pos'),
('important'),('pos'),
('new'),('pos'),
('patient'),('pos'),
('few'),('neg'),
('bad'),('neg'),

]

test=[
('general'),('pos'),
('many'),('pos'),
('efficient'),('pos'),
('great'),('pos'),
('interested'),('pos'),
('top'),('pos'),
('easy'),('pos'),
('big'),('pos'),
('new'),('pos'),
('wonderful'),('pos'),
('important'),('pos'),
('best'),('pos'),
('more'),('pos'),
('patient'),('pos'),
('last'),('pos'),
('worse'),('neg'),
('terrible'),('neg'),
('awful'),('neg'),
('bad'),('neg'),
('minimal'),('neg'),
('incomprehensible'),('neg'),
]

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, test)
classifier.show_most_informative_features()

2 个答案:

答案 0 :(得分:2)

<强> TLDR

你需要这个:

trainfeats=[('good','pos'),
('quick','pos'),
...

而不是:

trainfeats=[('good'),('pos'),
('quick'),('pos'),
...

<强>解释

ValueError: too many values to unpack内的关键错误是NaiveBayesClassifier.train,您可以在此行中调用:

classifier = NaiveBayesClassifier.train(trainfeats)

&#39;解包的价值太多&#39;意味着程序期望迭代中有一定数量的值,并且它接收的数量超过该数量。例如,从您的错误消息中,该行引发了错误:

for featureset, label in labeled_featuresets: 

这个for循环期望事物的被标记为“感觉集”,并且它会将该对中的一个成员分配给featureset,和label的一名成员。如果labeled_featuresets实际上有三元组,例如[(1,2,3),(1,2,3)...]然后程序不知道如何处理第三个元素,所以它会抛出错误。

以下是您传入该功能的内容,我认为该内容最终为labeled_featuresets

trainfeats=[('good'),('pos'),
('quick'),('pos'),
('easy'),('pos'),
...

您似乎正在尝试通过将该列表中的项目缩进为成对来创建元组列表(这可以防止您获得的错误),但是那些&#39> 。 Python不会使用缩进来推断元组,只有括号。我认为这就是你的目标:

trainfeats=[('good','pos'),
('quick','pos'),
('easy','pos'),
...

用括号括起每对,创建一个元组列表而不是单个元素列表。

答案 1 :(得分:0)

trainfeat变量应为:

 trainfeats=[({'good':True,'quick':True,'easy':True,
'big':True,'interested':True,'important':True,
'new':True,'patient':True},'pos'),({'few':True,'bad':True},'neg')]

这是nltk中标记功能集的正确格式。

类似地,测试变量应为:

test=[({'general':True,'many':True,'efficient':True,'great':True,'interested':True,'top':True,'easy':True,'big':True,'new':True,'wonderful':True,'important':True,'best':True,'more':True,'patient':True,'last':True},'pos'),({'worse':True,'terrible':True,'awful':True,'bad':True,'minimal':True,'incomprehensible':True},'neg')]