在nltk中使用Naive Bayes进行电影评论,提高文本分类的准确性

时间:2017-12-30 09:57:53

标签: python-3.x nlp nltk data-science text-classification

我指的是http://www.nltk.org/book/ch06.html来生成电影评论分类器。分类器将与(名词,形容词,动词......)对应的单词视为特征集的一部分。我正在尝试建立一个只考虑动词和评估的分类器,如果电影评论是正面的还是负面的。

请解释这种方法是否更好,如果是,如何改进,否则需要包含其他语音标签以改进功能集。

请参阅以下代码

  1. 导入nltk和语料库
  2. 了解类别
  3. 创建"文档列表"其中每个文档由不在stopwords.words()中的单词组成,并从"标点符号列表中过滤掉#34;同样。
  4. 随机播放文件并生成名为" all_words"的列表其中包含出现在movie_reviews中的所有单词减去停用词和标点符号。
  5. 创建" all_words"的频率范围
  6. 创建"动词列表"通过查看" all_words"
  7. 中出现的每个单词的pos_tag
  8. 为"文档"中的每个文档创建一个功能字典。在步骤3中创建的列表,使得字典将包含与"动词"中的单词对应的键。 list和一个布尔值,用于显示该动词是否出现在所考虑的文档中。这由documentFeature()处理。
  9. 创建一个朴素的贝叶斯分类器实例,并训练和计算测试精度。
  10. 代码

    from nltk.corpus import movie_reviews
    from nltk.corpus import stopwords
    from nltk.tokenize import RegexpTokenizer
    import nltk
    movie_reviews.categories()
    # ['pos','neg']
    # the regextokenier is used to tokenize the words    
    tokenizer=RegexpTokenizer(r'\w+')
    # creating documents based on filtration of stopwords for each review and running a tokenizer on each document
    documents=[(tokenizer.tokenize(' '.join(set(i for i in movie_reviews.words(fileid))-set(stopwords.words()))),category) 
               for category in movie_reviews.categories()
               for fileid in movie_reviews.fileids() 
              ]
    import random
    random.shuffle(documents)
    #each document contains words that are not in stopwords and punctuations.
    for i in documents[:5]:
        temp=nltk.FreqDist([j.lower() for j in i[0]])
        print(temp.most_common(5),i[1]) 
    #output of 5 documents
    #[('vampires', 1), ('clever', 1), ('interesting', 1), ('sunlight', 1), ('partners', 1)] neg
    #[('family', 1), ('nino', 1), ('friends', 1), ('acting', 1), ('higher', 1)] pos
    #[('inconsistent', 1), ('eye', 1), ('yes', 1), ('interesting', 1), ('praise', 1)] neg
    #[('acting', 1), ('science', 1), ('bucks', 1), ('huge', 1), ('terrific', 1)] pos
    #[('acting', 1), ('shielded', 1), ('somewhere', 1), ('think', 1), ('touched', 1)] neg
    
    #generate a list called 'all_words' that contains all the set of words that have appeared so far
    all_words=tokenizer.tokenize(' '.join(set(i for i in movie_reviews.words())-set(stopwords.words())))
    freqdist=nltk.FreqDist(all_words)
    
    #create a list of all verbs for each word appearing in 'all_words'
    verb=[]
    pos_=nltk.pos_tag(all_words)
    #print([i[1] for i in pos_])
    for i in pos_:
        if i[1] in ['VB','VBG','VBN','VBZ','VBD','VBP']:
            verb.append(i[0])
    
    #document - feature set, build a dictionary of verbs for each document
    def documentFeature(document):
        feature={}
        for i in verb:
            feature['contains({0})'.format(i)]=(i in document)
        return feature    
    #build a naive bayes classifier
    featureSet=[(documentFeature(d),c) for d,c in documents]
    trainSet,testSet=featureSet[100:], featureSet[:100]
    classifier=nltk.NaiveBayesClassifier.train(trainSet)
    
    print(nltk.classify.accuracy(classifier, testSet))
    #0.03 a very poor accuracy on the testset
    

    目前我的准确度为0.03,请帮助我提高准确度。

0 个答案:

没有答案