Question

我必须在AWS Sagemaker中部署自定义keras模型。我已经创建了一个笔记本实例，并且具有以下文件：

AmazonSagemaker-Codeset16
   -ann
      -nginx.conf
      -predictor.py
      -serve
      -train.py
      -wsgi.py
   -Dockerfile

我现在打开AWS终端并构建docker映像，并将该映像推送到ECR存储库中。然后，我打开一个新的jupyter python笔记本，并尝试拟合模型并进行部署。训练正确完成，但是在部署时出现以下错误：

“承载端点sagemaker-example-2019-10-25-06-11-22-366的错误：失败。>原因：生产变型AllTraffic的主容器未通过> ping健康检查。请检查CloudWatch日志为此端点...”

当我检查日志时，发现以下内容：

2019/11/11 11:53:32 [crit] 19＃19：* 3 connect（）到unix：/tmp/gunicorn.sock>在连接到上游时失败（2：没有这样的文件或目录），客户端：> 10.32.0.4，服务器：，请求：“ GET / ping HTTP / 1.1”，上游：>“ http://unix:/tmp/gunicorn.sock:/ping”，主机：“ model.aws.local：8080”

和

回溯（最近通话最近）：文件“ / usr / local / bin / serve”，第8行，在 sys.exit（main（））主目录中的文件“ /usr/local/lib/python2.7/dist->packages/sagemaker_containers/cli/serve.py”，第19行 server.start（env.ServingEnv（）。framework_module）在开始的文件“ /usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py”中，第107行 module_app， init 中的文件“ /usr/lib/python2.7/subprocess.py”，第711行 errread，errwrite） _execute_child中的文件“ /usr/lib/python2.7/subprocess.py”，行1343 提高child_exception

我尝试在本地计算机上将这些文件部署在AWS Sagemaker中的相同模型，并且该模型已成功部署，但是在AWS内部，我遇到了这个问题。

这是我的服务文件代码：

from __future__ import print_function
import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))


def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)


def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/ml/code/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')


# The main routine just invokes the start function.
if __name__ == '__main__':
    start_server()

我使用以下方法部署模型：

predictor = classifier.deploy（1，'ml.t2.medium'，serializer = csv_serializer）

请让我知道我在部署时遇到的错误。

Answer 1

使用Sagemaker脚本模式比处理容器和nginx低级东西要简单得多，您考虑过吗？
您只需要提供keras脚本：

通过脚本模式，您可以将培训脚本与在SageMaker外部使用SageMaker的预构建容器的培训脚本类似，以用于各种深度学习框架，例如TensorFlow，PyTorch和Apache MXNet。

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-sentiment-script-mode/sentiment-analysis.ipynb

Answer 2

您应确保您的容器可以响应GET / ping请求：https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests

从回溯来看，当在SageMaker中启动容器时，服务器似乎无法启动。我将在堆栈跟踪中进一步查找，并查看服务器启动失败的原因。

您还可以尝试在本地运行容器以调试任何问题。 SageMaker使用“ docker run serve”命令启动容器，因此您可以运行相同的命令并调试容器。 https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image

Answer 3

您没有安装gunicorn，这就是错误/tmp/gunicorn.sock>失败（2：没有这样的文件或目录）的原因，您需要在Dockerfile上编写pip install gunicorn和apt-get install nginx。 / p>

如何在AWS Sagemaker中部署模型来解决错误？

3 个答案: