使用NLTK / NaiveBayesClassifier,始终在测试数据上获得“正向”输出

时间:2018-07-05 18:52:32

标签: python machine-learning nltk naivebayes

我最近开始尝试进入机器学习,并且正在遵循一个教程,该教程创建了一个模型,该模型可以确定输入的推文是肯定的还是负面的。该程序运行良好,但还不够准确,我还不想解决使用Twitter API的问题,因此我尝试将其转换为预测电影评论的立场(正面/负面)。我认为这样做会更容易,一旦开始工作,我可以尝试使用Twitter。

但是,既然我终于开始运行,我总是得到“积极”的结果,我的训练数据是一组400条正面和400条负面电影评论。

这是我从中获得数据集的地方:(确切的链接是“情感极性数据集”下的第一个链接,称为“极性数据集v2.0(3.0Mb)”。 http://www.cs.cornell.edu/people/pabo/movie-review-data/

我没有使用全部2000条评论,只有前400条来自积极评论,而400条来自负面评论。

import nltk
import glob
import errno


path = r"C:\Users\Thomas\tweets\pos\*.txt"
files = glob.glob(path)

pos_rev = []
neg_rev = []
for name in files:
    try:
        with open(name) as f:
            content = f.read()
            pos_rev.append((content, 'positive'))

    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise
for name in files:
    try:
        with open(name) as f:
            content = f.read()
            neg_rev.append((content, 'negative'))
    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise
# create array to store all reviews
reviews = []

# seperate reviews into individual words, removing words 2 words or less
# create training set (reviews)
for (words, sentiment) in pos_rev + neg_rev:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    reviews.append((words_filtered, sentiment))




def get_words_in_reviews(reviews):
    all_words = []
    for (words, sentiment) in reviews:
        all_words.extend(words)
    return all_words
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

word_features = get_word_features(get_words_in_reviews(reviews))

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

training_set = nltk.classify.apply_features(extract_features, reviews)
classifier = nltk.NaiveBayesClassifier.train(training_set)

review = 'That movie was very bad.  Poor directing, terrible acting and horrible production.'
print(classifier.classify(extract_features(review.split())))

无论我对分类器添加什么,它总是会返回正数。

此外,如果仍然有人在这里读书,这到底是怎么做的:

except IOError as exc:
    if exc.errno != errno.EISDIR:
        raise

我知道尝试打开文件时会出现错误,但是关于IOError,.errno,!=和抬高,我应该知道什么重要的知识吗?还是在读取文件时这只是除块以外的标准?

在此先感谢您的帮助!

0 个答案:

没有答案