Python Naive Bayes Classifier在Movie Review Corpus上接受过Tweet测试

时间:2015-12-06 23:46:21

标签: python twitter nlp nltk naivebayes

import nltk.classify.util
import csv
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
    return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

我是Python的新手,正在尝试对推文进行情绪分析。我正在使用内置于NLTK包中的朴素贝叶斯分类器。我正在电影评论语料库中测试它,并希望测试我使用Tweepy存储到.txt或.csv文件中的推文。任何人都可以帮助弄清楚如何在输出文件中的推文上测试这个分类器?谢谢!

1 个答案:

答案 0 :(得分:0)

只需加载推文:

f = open('tweets.txt', 'r')
data = f.readlines()
testfeats = word_feats([tweet.split(' ') for tweet in data]) # for file with tweets separated by line

然后你可以使用你的word_feats方法来提取特征(你可以改为使用CountVectorizer)。