我导入了nltk包。我需要使用nltk.sent_tokenize和nltk.word_tokenize,当我这样做时,无论如何我都会收到以下错误:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.0.0.4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 2, in <lambda>
File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 85, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 781, in load
opened_resource = _open(resource_url)
File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 895, in _open
return find(path_, path + ['']).open()
File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 624, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
我已经提到许多讨论这个话题的帖子并尝试过 nltk.download(all'), - d,也安排子文件夹。
Azure ML的python不附带nltk库。应该有一些其他方法在这个平台上使用nltk。请帮忙!! 谢谢!
答案 0 :(得分:0)
根据错误信息,我认为问题是由以下两个原因造成的,因为@alvas说。
nltk.data.path
。我试图按照@alvas发布的评论来执行以下步骤。
# 1. Installation for package `nltk` using `pip`
~ $ pip install nltk
# 2. Download resource in Python REPL.
~ $ python
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> punkt
Downloading package punkt to /home/<username>/nltk_data...
Unzipping tokenizers/punkt.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>>
# 3. Check the resource path listed in the path list `nltk.data.path`
>>> nltk.data.path
['/home/<username>/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
~ $ pwd && ls nltk_data
/home/<username>
tokenizers
# 4. If the dir `nltk_data` not in the path list `nltk.data.path`,
# you can try to move the dir to the specified path listed in the path list,
~ $ mv nltk_data /usr/local/share/
# or append the path of the dir `nltk_data` into the list variable `nltk.data.path`.
>>> nltk.data.path.append('/home/<username>/nltk_data/')
希望它有所帮助。