Question

我正忙于编写模型来预测pdf文档中的文本类型，例如名称或日期。

模型使用nltk.word_tokenize和nltk.pos_tag

当我尝试在Google Cloud Platform的Kubernetes上使用它时，出现以下错误：

    from nltk.tag import pos_tag
    from nltk.tokenize import word_tokenize

    tokenized_word = tokenize_word('x')
    tagges_word = pos_tag(['x'])

stacktrace：

      Resource punkt not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('punkt')

      Searched in:
      - '/root/nltk_data'
      - '/usr/share/nltk_data'
      - '/usr/local/share/nltk_data'
      - '/usr/lib/nltk_data'
      - '/usr/local/lib/nltk_data'
      - '/env/nltk_data'
      - '/env/share/nltk_data'
      - '/env/lib/nltk_data'
      - ''

但是显然，如果必须在Kubernetes上运行它，将其下载到本地设备并不能解决问题，而且我们尚未在项目中设置NFS。

Answer 1

我最终解决此问题的方法是在 init 函数中添加nltk软件包的下载

import logging
import nltk
from nltk import word_tokenize, pos_tag

LOGGER = logging.getLogger(__name__)

LOGGER.info('Catching broad nltk errors')
DOWNLOAD_DIR = '/usr/lib/nltk_data'
LOGGER.info(f'Saving files to {DOWNLOAD_DIR} ')

try:
    tokenized = word_tokenize('x')
    LOGGER.info(f'Tokenized word: {tokenized}')
except Exception as err:
    LOGGER.info(f'NLTK dependencies not downloaded: {err}')
    try:
        nltk.download('punkt', download_dir=DOWNLOAD_DIR)
    except Exception as e:
        LOGGER.info(f'Error occurred while downloading file: {e}')

try:
    tagged_word = pos_tag(['x'])
    LOGGER.info(f'Tagged word: {tagged_word}')
except Exception as err:
    LOGGER.info(f'NLTK dependencies not downloaded: {err}')
    try:
        nltk.download('averaged_perceptron_tagger', download_dir=DOWNLOAD_DIR)
    except Exception as e:
        LOGGER.info(f'Error occurred while downloading file: {e}')

我意识到不需要大量的try catch表达式。我还指定了下载目录，因为如果您不这样做，它似乎会将“ tagger”下载并解压缩到/ usr / lib，并且nltk不会在其中查找文件。

这将在每次在新Pod上首次运行时下载文件，并且这些文件将一直保留到Pod死掉为止。

该错误已在Kubernetes无状态集上解决，这意味着它可以处理非持久性应用程序（如App Engine），但效率不是最高，因为每次实例旋转时都需要下载。

如何在App Engine或Kubernetes上运行NLTK？

1 个答案: