我无法在 PlaintextCorpusReader 中获得para和sents功能。这是我的代码:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './dir_root'
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add
word_list = newcorpus.words('file1.txt')
sentence_list = newcorpus.sents('file1.txt')
paragraph_list = newcorpus.paras('file1.txt')
print(word_list)
print(sentence_list)
print(paragraph_list)
word_list很好。
['__________________________________________________________________', 'Title', ...]
但是,paragraph_list和sentence_list都会出现此错误:
Traceback (most recent call last):
File "corpus.py", line 13, in <module>
print(sentence_list)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__
for elt in self:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block
for sent in self._sent_tokenizer.tokenize(para)])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__
self.__load()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load
resource = load(self._path)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load
opened_resource = _open(resource_url)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open
return find(path_, path + ['']).open()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'tokenizers/punkt/PY3/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/Users/username/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
我尝试使用nltk.download()将文件下载到语料库中,但这也无效。此外,由于PlaintextCorpusReader已经这样做,它似乎不应该像它应该的那样工作。 para 和 sents 函数是PlaintextCorpusReader的一部分。我需要输入特定的fieldid吗?或者,是否需要某种正则表达式参数才能找到句子或段落? documentation和source code似乎并不表示它需要比单词功能更多的内容。
答案 0 :(得分:4)
您错过了句子标记生成器所需的数据文件(“资源”)。通过下载交互式下载器中“模型”下的“punkt”资源或通过运行此代码一次非交互式来解决问题:
nltk.download("punkt")
为了避免在探索nltk时反复遇到这类问题,我建议现在下载“book”套装。它包含了您可能需要一段时间的所有内容。