修改

Question

我有两个文本文件：一个包含25K正面推文，每行分开，第二个25K负推文，每行分开。

如何使用这两个文本文件创建语料库，以便将新推文分类为正面或负面？我想将NLTK模块用于Python。

修改

与Using my own corpus instead of movie_reviews corpus for Classification in NLTK

的区别

是我的数据包含两个文本文件：一个包含25K正面推文，每行分开。第二个有25K的负推文，同样的分离。

如果我使用上面链接中提到的技术，它对我不起作用。

当我运行此代码时：

import string; from nltk.corpus import stopwords
from nltk.corpus import CategorizedPlaintextCorpusReader
import traceback
import sys

try:
    mr = CategorizedPlaintextCorpusReader('C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    stop = stopwords.words('english')
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

    for doc in documents:
        print doc
except Exception, err:
    print traceback.format_exc()
    #or
    print sys.exc_info()[0]

我收到错误消息：

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py"
    Traceback (most recent call last):
      File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py", line 17, in <module>
        documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
      File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
        assert self._len is not None
    AssertionError

    <type 'exceptions.AssertionError'>

有谁知道如何解决这个问题？

使用NLTK for Python训练用于情感分析的消极和积极推文的语料库

修改

0 个答案: