Question

我指的是http://www.nltk.org/book/ch06.html来生成电影评论分类器。分类器将与（名词，形容词，动词......）对应的单词视为特征集的一部分。我正在尝试建立一个只考虑动词和评估的分类器，如果电影评论是正面的还是负面的。

请解释这种方法是否更好，如果是，如何改进，否则需要包含其他语音标签以改进功能集。

请参阅以下代码

导入nltk和语料库
了解类别
创建＆＃34;文档列表＆＃34;其中每个文档由不在stopwords.words（）中的单词组成，并从＆＃34;标点符号列表中过滤掉＃34;同样。
随机播放文件并生成名为＆＃34; all_words＆＃34;的列表其中包含出现在movie_reviews中的所有单词减去停用词和标点符号。
创建＆＃34; all_words＆＃34;的频率范围
创建＆＃34;动词列表＆＃34;通过查看＆＃34; all_words＆＃34;
为＆＃34;文档＆＃34;中的每个文档创建一个功能字典。在步骤3中创建的列表，使得字典将包含与＆＃34;动词＆＃34;中的单词对应的键。 list和一个布尔值，用于显示该动词是否出现在所考虑的文档中。这由documentFeature（）处理。
创建一个朴素的贝叶斯分类器实例，并训练和计算测试精度。

代码

from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import nltk
movie_reviews.categories()
# ['pos','neg']
# the regextokenier is used to tokenize the words    
tokenizer=RegexpTokenizer(r'\w+')
# creating documents based on filtration of stopwords for each review and running a tokenizer on each document
documents=[(tokenizer.tokenize(' '.join(set(i for i in movie_reviews.words(fileid))-set(stopwords.words()))),category) 
           for category in movie_reviews.categories()
           for fileid in movie_reviews.fileids() 
          ]
import random
random.shuffle(documents)
#each document contains words that are not in stopwords and punctuations.
for i in documents[:5]:
    temp=nltk.FreqDist([j.lower() for j in i[0]])
    print(temp.most_common(5),i[1]) 
#output of 5 documents
#[('vampires', 1), ('clever', 1), ('interesting', 1), ('sunlight', 1), ('partners', 1)] neg
#[('family', 1), ('nino', 1), ('friends', 1), ('acting', 1), ('higher', 1)] pos
#[('inconsistent', 1), ('eye', 1), ('yes', 1), ('interesting', 1), ('praise', 1)] neg
#[('acting', 1), ('science', 1), ('bucks', 1), ('huge', 1), ('terrific', 1)] pos
#[('acting', 1), ('shielded', 1), ('somewhere', 1), ('think', 1), ('touched', 1)] neg

#generate a list called 'all_words' that contains all the set of words that have appeared so far
all_words=tokenizer.tokenize(' '.join(set(i for i in movie_reviews.words())-set(stopwords.words())))
freqdist=nltk.FreqDist(all_words)

#create a list of all verbs for each word appearing in 'all_words'
verb=[]
pos_=nltk.pos_tag(all_words)
#print([i[1] for i in pos_])
for i in pos_:
    if i[1] in ['VB','VBG','VBN','VBZ','VBD','VBP']:
        verb.append(i[0])

#document - feature set, build a dictionary of verbs for each document
def documentFeature(document):
    feature={}
    for i in verb:
        feature['contains({0})'.format(i)]=(i in document)
    return feature    
#build a naive bayes classifier
featureSet=[(documentFeature(d),c) for d,c in documents]
trainSet,testSet=featureSet[100:], featureSet[:100]
classifier=nltk.NaiveBayesClassifier.train(trainSet)

print(nltk.classify.accuracy(classifier, testSet))
#0.03 a very poor accuracy on the testset

目前我的准确度为0.03，请帮助我提高准确度。

在nltk中使用Naive Bayes进行电影评论，提高文本分类的准确性

代码

0 个答案: