Question

我尝试创建自己的语料库，用于推文的情绪分析（无论是正面还是负面）。

我是第一次尝试现有的NLTK电影评论语料库。但是，如果我使用此代码：

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

我收到输出：

0.31
Most Informative Features
               uplifting = True              pos : neg    =      5.9 : 1.0
               wednesday = True              pos : neg    =      3.7 : 1.0
             controversy = True              pos : neg    =      3.4 : 1.0
                  shocks = True              pos : neg    =      3.0 : 1.0
                  catchy = True              pos : neg    =      2.6 : 1.0

而不是预期的输出（见Classification using movie review corpus in NLTK/Python）：

0.655
Most Informative Features
                     bad = True              neg : pos    =      2.0 : 1.0
                  script = True              neg : pos    =      1.5 : 1.0
                   world = True              pos : neg    =      1.5 : 1.0
                 nothing = True              neg : pos    =      1.5 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0

我使用与其他StackOverflow页面完全相同的代码，我的NLTK（和他们的）是最新的，我也有最新的电影评论语料库。任何人都知道出了什么问题？

谢谢！

Answer 1

我的猜测是下面的行有所不同：

word_features = word_features.keys()[:100]

word_features是一个dict（Counter更精确）对象，keys()方法以任意顺序返回值，因此训练集中的要素列表与初始帖子中的要素列表不同。

https://docs.python.org/2/library/stdtypes.html#dict.items

使用NLTK电影评论语料库进行分类

1 个答案: