Question

我尝试使用nltk的sent_tokenize()，所以我已经下载了

import nltk
nltk.download("stopwords")
nltk.download("punkt")

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

# tokenize sentences
sentences = [sent for sent in sent_tokenize(data, "russian")]

但它返回了我

LookupError: 
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

  import nltk
  nltk.download('punkt')

  Searched in:
- '/Users/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'

但是我不明白为什么，我已经下载了它。我尝试使用nltk.download()，但是我没有太多的RAM，因此它工作太慢。我该在那里修改些什么来解决它？

Answer 1

您可以尝试

nltk.download("popular")

它下载NLTK的最基本工具，例如令牌生成器和停用词

NLTK：应该使用send_tokenize下载什么

1 个答案: