Question

我正在尝试使用NLTK语料库阅读器加载 20newsgroups 语料库，然后我从所有文档中提取单词并标记它们。但是当我尝试构建提取和标记列表时，它显示错误。

以下是 CODE ：

import nltk
import random

from nltk.tokenize import word_tokenize

newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"C:\nltk_data\corpora\20newsgroups",
    r'(?!\.).*\.txt', 
    cat_pattern=r'(not_sports|sports)/.*',
    encoding="utf8")

documents = [(list(newsgroups.words(fileid)), category)
             for category in newsgroups.categories()
             for fileid in newsgroups.fileids(category)]

random.shuffle(documents)

相应的 ERROR 是：

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-de2a1a6859ea> in <module>()
      1 documents = [(list(newsgroups.words(fileid)), category)
----> 2              for category in newsgroups.categories()
      3              for fileid in newsgroups.fileids(category)]
      4 
      5 random.shuffle(documents)

<ipython-input-10-de2a1a6859ea> in <listcomp>(.0)
      1 documents = [(list(newsgroups.words(fileid)), category)
      2              for category in newsgroups.categories()
----> 3              for fileid in newsgroups.fileids(category)]
      4 
      5 random.shuffle(documents)

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
    231             # iterate_from() sets self._len when it reaches the end
    232             # of the file:
--> 233             for tok in self.iterate_from(self._toknum[-1]): pass
    234         return self._len
    235 

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
    294             self._current_toknum = toknum
    295             self._current_blocknum = block_index
--> 296             tokens = self.read_block(self._stream)
    297             assert isinstance(tokens, (tuple, list, AbstractLazySequence)), (
    298                 'block reader %s() should return list or tuple.' %

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream)
    120         words = []
    121         for i in range(20): # Read 20 lines at a time.
--> 122             words.extend(self._word_tokenizer.tokenize(stream.readline()))
    123         return words
    124 

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size)
   1166         while True:
   1167             startpos = self.stream.tell() - len(self.bytebuffer)
-> 1168             new_chars = self._read(readsize)
   1169 
   1170             # If we're at a '\r', then read one extra character, since

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size)
   1398 
   1399         # Decode the bytes into unicode characters
-> 1400         chars, bytes_decoded = self._incr_decode(bytes)
   1401 
   1402         # If we got bytes but couldn't decode any, then read further.

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes)
   1429         while True:
   1430             try:
-> 1431                 return self.decode(bytes, 'strict')
   1432             except UnicodeDecodeError as exc:
   1433                 # If the exception occurs at the end of the string,

C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte

我尝试将语料库阅读器中的编码更改为 ascii 和 utf16 。这也不起作用。我不确定我提供的正则表达式是否正确。 20newsgroups语料库中的文件名采用由连字符（ - ）分隔的2个数字的形式，例如：

5-53286

102-53553

8642-104983

我担心的第二件事是在读取特征提取时是否从文档内容生成错误。以下是20newsgroups语料库中的文档：

来自：bil@okcforum.osrhe.edu（比尔康纳）主题：回复：自由道德   机构

dean.kaflowitz（decay@cbnewsj.cb.att.com）写道::＆gt; ：＆gt;我认为   你是在放弃无神论神话

：好的开始。我立即意识到你不感兴趣：in   讨论，并将打击你的唠叨。我会：很多   更喜欢希利女士的回答，他似乎有一个：合理的和   理性的方法。说，你不是：创造论者   谁曾经做过很多关于：进化的愚蠢陈述？

：呃，哎呀，那我们现在一定要谈基督教神话。我曾是   希望用合理，合乎逻辑的人来讨论一些事情，但是   你似乎拥有的所有东西都是重复的：同样的   无聊的神话我以前见过一千次。：我正在删除   你的其他言论，除非我发现一些东西：接近一个   回答，因为它们只是一种重复：一些无趣的东西   学说或其他，根本不包含任何想法。

：比如，我要向你表示祝贺。你不会：知道一个   逻辑论证，如果它咬你的球。这样：持久的缺乏   面对重复的功能：试图帮助你   学习（我见过：在这个论坛和过去的其他人）   谈到天赋：这远远超出了我自己的微薄能力。   我只是不这样做：似乎有能力忽视外面   的影响。

：Dean Kaflowitz

迪安，

重新阅读你的评论，你认为这仅仅是一个特征   论证与反驳一样吗？你认为这个广告是什么？   攻击足以让你不赞成   我？你有什么贡献吗？

比尔

From: cmk@athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris

In article <1993Apr19.024222.11181@newshub.ariel.yorku.ca> cs902043@ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178@cs.cornell.edu> tedward@cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210@mnemosyne.cs.du.edu> gspira@nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong@mindlink.bc.ca (Howard Wong) writes: }
>> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky.  You can }
>count on good tomorrow.  Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up.  Who was it who said } >Morris was a better signing than Viola?) }  } Hey Valentine, I don't see Boston with any world series rings on their } fingers.

oooooo. cheap shot. :^)

} Damn, Morris now has three and probably the Hall of Fame in his  } future.

who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too.

} Therefore, I would have to say Toronto easily made the best  } signing.

your logic is curious, and spurious.

there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too.

now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs.

the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say?

} And don't tell me Boston will win this year.  They won't  } even be in the top 4 in the division, more like 6th.

if this is true, it won't be for lack of contribution by Viola, so who cares?

-*- charles

请在加载文档或阅读文件和提取文字时建议我是否出错。如何正确加载语料库需要做什么？

Answer 1

NLTK有语料库加载问题

您可以使用

加载有用的类别数据

from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

newsgroups_train.target_names为您提供类别。

加载错误时使用NLTK阅读自定义20newsgroups语料库

1 个答案: