我正忙于编写模型来预测pdf文档中的文本类型,例如名称或日期。
模型使用nltk.word_tokenize和nltk.pos_tag
当我尝试在Google Cloud Platform的Kubernetes上使用它时,出现以下错误:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
tokenized_word = tokenize_word('x')
tagges_word = pos_tag(['x'])
stacktrace:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/env/nltk_data'
- '/env/share/nltk_data'
- '/env/lib/nltk_data'
- ''
但是显然,如果必须在Kubernetes上运行它,将其下载到本地设备并不能解决问题,而且我们尚未在项目中设置NFS。
答案 0 :(得分:2)
我最终解决此问题的方法是在 init 函数中添加nltk软件包的下载
import logging
import nltk
from nltk import word_tokenize, pos_tag
LOGGER = logging.getLogger(__name__)
LOGGER.info('Catching broad nltk errors')
DOWNLOAD_DIR = '/usr/lib/nltk_data'
LOGGER.info(f'Saving files to {DOWNLOAD_DIR} ')
try:
tokenized = word_tokenize('x')
LOGGER.info(f'Tokenized word: {tokenized}')
except Exception as err:
LOGGER.info(f'NLTK dependencies not downloaded: {err}')
try:
nltk.download('punkt', download_dir=DOWNLOAD_DIR)
except Exception as e:
LOGGER.info(f'Error occurred while downloading file: {e}')
try:
tagged_word = pos_tag(['x'])
LOGGER.info(f'Tagged word: {tagged_word}')
except Exception as err:
LOGGER.info(f'NLTK dependencies not downloaded: {err}')
try:
nltk.download('averaged_perceptron_tagger', download_dir=DOWNLOAD_DIR)
except Exception as e:
LOGGER.info(f'Error occurred while downloading file: {e}')
我意识到不需要大量的try catch表达式。我还指定了下载目录,因为如果您不这样做,它似乎会将“ tagger”下载并解压缩到/ usr / lib,并且nltk不会在其中查找文件。
这将在每次在新Pod上首次运行时下载文件,并且这些文件将一直保留到Pod死掉为止。
该错误已在Kubernetes无状态集上解决,这意味着它可以处理非持久性应用程序(如App Engine),但效率不是最高,因为每次实例旋转时都需要下载。