Python NLTK朴素贝叶斯分类器

时间:2016-05-30 17:21:32

标签: python-3.x nltk corpus

我试图在数据集上实现NLTK朴素贝叶斯分类器,该数据集具有带有特征提取函数features_all()的正面和负面类别。当我运行代码时,我在features_all()函数的一行中得到一个错误。

朴素贝叶斯代码:

import nltk
import random
from nltk.corpus import stopwords
import nltk.classify.util
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
import re

from feature_extractors import features_all #function for features extraction

path = "/.../all kom"

reader = CategorizedPlaintextCorpusReader(path,r'.*\.txt',cat_pattern=r'(^\w..)/*')

po=reader.sents(categories=['pos']) #tokenize 
ne=reader.sents(categories=['neg'])

labeled_sentiments = ([(n, 'positive') for n in po] + [(n, 'negative') for n in ne])

size = int(len(labeled_sentiments) * 0.9) #for separating training set in 90:10
random.shuffle(labeled_sentiments)

featuresets = [(features_all(n), sentiment) for (n, sentiment) in labeled_sentiments]
train_set = featuresets[:size]
test_set = featuresets[size:]

#Naive Bayes
classifier = nltk.NaiveBayesClassifier.train(train_set)
#test
print(classifier.classify(features_all('great')))
print(classifier.classify(features_all('bad')))
print('Accuracy for Naive Bayes: ',nltk.classify.accuracy(classifier,   test_set))
print(classifier.show_most_informative_features(15))

features_all()函数:

def features_all(dat):

    f_all_dict=open('all_dict.txt','r',encoding='utf-8').read()

    f = literal_eval(f_all_dict)

    result_all = {} 

    for word in f.items():
        result_all = {"{}_{}".format(word, suffix): pol * dat.count(word) for word, (suffix, pol) in f.items()} #here is where I get the error

    if len(f) == len(result_all):
       return result_all
    else:
       return None

并且features_all()提供类似(示例)的输出:

great_pos:1, bad_neg:1

all_dict.txt看起来像这样:

"great":("pos",2),"bad":("neg",2)

我在线收到错误 result_all = {"{}_{}".format(word, suffix): pol * dat.count(word) for word, (suffix, pol) in f.items()}

因为我不确切地知道错误是什么,因为当我运行代码时它并不想完成执行,所以我停止执行,这就是它停止的地方,所以我'我很确定它就在这条线上。我有点困惑,如果问题出在格式化或功能输入上,我就不知道了。如果有人可以帮助我会很感激。

1 个答案:

答案 0 :(得分:2)

非常确定您只需要在"{}_{}:{}".format(word, suffix, pol * dat.count(word)) for word, (suffix, pol) in f.items()的格式化return语句中包含results_all。检查代码是否有效的一种非常简单的方法是检查您是否始终以您期望的格式获得输出!如果您只是print("{}_{}".format(word, suffix): pol * dat.count(word) for word, (suffix, pol) in f.items()),则会出现无效语法错误。如果您不确定代码,请保留打印声明!