处理nltk语料库中的pickel文件

时间:2018-03-13 10:45:00

标签: python python-3.x nltk pickle

我正在尝试nltk tutorial

problem I was facing需要下载各种语料库。在所有解决方案都无法解决问题之后,我面临着使用nltk.download()下载nltk语料库的问题,我采取了here所述的步骤。

我开始从this page下载任何示例所需的语料库,并将其放在目录D:\nltk_data\corpora中。我能够尝试各种例子。但是在一个例子中我得到了错误:

 Resource punkt not found.
 Please use the NLTK Downloader to obtain the resource:

 >>> import nltk
 >>> nltk.download('punkt')

所以我从同一页面下载了punkt并复制粘贴在同一个目录中。但它没有奏效。还试图像其他语料库一样from nltk.corpus import punkt。但没用。它说Unresolved import: punkt

punkt与其他语料库的一个区别是它包含pickle文件而不是文本文件,就像其他语料库一样。我该如何解决这个问题?

代码:

import nltk;

from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

错误:

Traceback (most recent call last):
  File "D:\Mahesh\workspaces\pyworkspace\nltkdemo\chp2\chp2.py", line 8, in <module>
    num_sents = len(gutenberg.sents(fileid))
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\util.py", line 233, in __len__
    for tok in self.iterate_from(self._toknum[-1]): pass
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\plaintext.py", line 129, in _read_sent_block
    for sent in self._sent_tokenizer.tokenize(para)])
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 984, in __getattr__
    self.__load()
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 976, in __load
    resource = load(self._path)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 836, in load
    opened_resource = _open(resource_url)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 954, in _open
    return find(path_, path + ['']).open()
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 675, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - 'C:\\Users\\593932/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\share\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\lib\\nltk_data'
    - 'C:\\Users\\Mahesha999\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

错误似乎发生在第8行:num_sents = len(gutenberg.sents(fileid))

0 个答案:

没有答案