我最近开始尝试进入机器学习,并且正在遵循一个教程,该教程创建了一个模型,该模型可以确定输入的推文是肯定的还是负面的。该程序运行良好,但还不够准确,我还不想解决使用Twitter API的问题,因此我尝试将其转换为预测电影评论的立场(正面/负面)。我认为这样做会更容易,一旦开始工作,我可以尝试使用Twitter。
但是,既然我终于开始运行,我总是得到“积极”的结果,我的训练数据是一组400条正面和400条负面电影评论。
这是我从中获得数据集的地方:(确切的链接是“情感极性数据集”下的第一个链接,称为“极性数据集v2.0(3.0Mb)”。 http://www.cs.cornell.edu/people/pabo/movie-review-data/
我没有使用全部2000条评论,只有前400条来自积极评论,而400条来自负面评论。
import nltk
import glob
import errno
path = r"C:\Users\Thomas\tweets\pos\*.txt"
files = glob.glob(path)
pos_rev = []
neg_rev = []
for name in files:
try:
with open(name) as f:
content = f.read()
pos_rev.append((content, 'positive'))
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
for name in files:
try:
with open(name) as f:
content = f.read()
neg_rev.append((content, 'negative'))
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
# create array to store all reviews
reviews = []
# seperate reviews into individual words, removing words 2 words or less
# create training set (reviews)
for (words, sentiment) in pos_rev + neg_rev:
words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
reviews.append((words_filtered, sentiment))
def get_words_in_reviews(reviews):
all_words = []
for (words, sentiment) in reviews:
all_words.extend(words)
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
word_features = get_word_features(get_words_in_reviews(reviews))
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
training_set = nltk.classify.apply_features(extract_features, reviews)
classifier = nltk.NaiveBayesClassifier.train(training_set)
review = 'That movie was very bad. Poor directing, terrible acting and horrible production.'
print(classifier.classify(extract_features(review.split())))
无论我对分类器添加什么,它总是会返回正数。
此外,如果仍然有人在这里读书,这到底是怎么做的:
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
我知道尝试打开文件时会出现错误,但是关于IOError,.errno,!=和抬高,我应该知道什么重要的知识吗?还是在读取文件时这只是除块以外的标准?
在此先感谢您的帮助!