使用NLTK for Python

时间:2015-04-22 14:05:51

标签: python twitter nlp nltk sentiment-analysis

我正在尝试使用NLTK进行python训练我自己的语料库以进行情绪分析。我有两个文本文件:一个有25K正面推文,每行分开,另一个是25K负推文。

I use this Stackoverflow article, method 2

当我运行此代码来创建语料库时:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

我收到错误消息:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
  File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
    assert self._len is not None
AssertionError

Process finished with exit code 1

有谁知道如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

我不是100%肯定,因为我现在不在Windows机器上进行测试,但我认为可能会引起你注意的是@alvas原始示例中的路径斜线方向与您的适应窗户。

具体来说,您使用<div class="box3"> <h2 class="minecrafter" style="float:left;padding-left:15px;padding-top:10px;letter-spacing:3px;">Apply Now</h2> <p class="minecrafter" style="float:left;padding-left:15px;letter-spacing:1px;padding-top:5px;font-size:13px;">Lorem ipsum dolor sit amet, mel id fabulas dolorum, lorem vulputate ei his. </p> <img src="images/applyheretoday.png" style="height:90%;float:right;margin-top:13px;margin-right:10px"> </div> .box3 { margin-top:3px; margin-left:5%; float:left; width:65%; background:#707070; height:300px; } .minecrafter { font-family:minecrafter; color:#FFFFFF; } ,而他的示例使用'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'。在大多数情况下,这很好,但是你试图重新使用他的'/home/alvas/my_movie_reviews'正则表达式:cat_pattern,它将匹配路径中的斜杠,但拒绝你的路径中的斜杠。