AWS SageMaker TensorFlow服务-端点故障-CloudWatch日志参考:“ NET_LOG:进入事件循环...”

时间:2019-12-23 22:58:04

标签: docker tensorflow nginx tensorflow-serving amazon-sagemaker

这是我第一次使用sagemaker服务自己的自定义tensorflow模型,因此我一直在使用中级文章来入门:

How to Create a TensorFlow Serving Container for AWS SageMaker
How to Push a Docker Image to AWS ECS Repository
How to Deploy an AWS SageMaker Container Using TensorFlow Serving
How to Make Predictions Against a SageMaker Endpoint Using TensorFlow Serving

我设法创建了服务容器,将其成功推送到ECR,并从docker映像创建了sagemaker模型。但是,当我尝试创建端点时,它开始创建,但是在3-5分钟后以失败消息结束:

  

“生产变式Default的主容器未通过   ping健康检查。请检查此端点的CloudWatch日志。”

Failure Image

然后我检查了如下所示的云监视日志...

CloudWatch Logs

...以“ NET_LOG:进入事件循环...”结尾

我尝试通过Google搜索更多有关此日志消息的信息,以与通过tf-serving部署sagemaker模型有关,但是找不到任何有用的解决方案。

要提供更多背景信息,在遇到此问题之前,我遇到了另外两个问题:

  
      
  1. “ FileSystemStoragePathSource遇到文件系统访问错误:   找不到‹MODEL_NAME›的基本路径

    ‹MODEL_PATH› // ‹MODEL_NAME› /”

  2.   
  3. “在基本路径下找不到可服务版本”
  4.   

我都设法通过以下链接来解决:

[Documentation] TensorFlowModel endpoints need the export/Servo folder structure, but this is not documented

Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

还值得注意的是,我的Tensorflow模型是使用TF版本2.0创建的(因此为什么需要docker容器)。我只使用AWS CLI来执行tensorflow服务,而不是使用sagemaker SDK。

以下是我的shell脚本的摘要:

nginx.config

events {
    # determines how many requests can simultaneously be served
    # https://www.digitalocean.com/community/tutorials/how-to-optimize-nginx-configuration
    # for more information
    worker_connections 2048;
}

http {
  server {
    # configures the server to listen to the port 8080
    # Amazon SageMaker sends inference requests to port 8080.
    # For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
    listen 8080 deferred;

    # redirects requests from SageMaker to TF Serving
    location /invocations {
      proxy_pass http://localhost:8501/v1/models/pornilarity_model:predict;
    }

    # Used by SageMaker to confirm if server is alive.
    # https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests
    location /ping {
      return 200 "OK";
    }
  }
}

Dockerfile


# RUN pip install sagemaker-containers

# Installing NGINX, used to reverse proxy the predictions from SageMaker to TF Serving
RUN apt-get update && apt-get install -y --no-install-recommends nginx git

# Copy our model folder to the container 
# NB: Tensorflow serving requires you manually assign version numbering to models e.g. model_path/1/
# see below links: 

# https://stackoverflow.com/questions/45544928/tensorflow-serving-no-versions-of-servable-model-found-under-base-path
# https://github.com/aws/sagemaker-python-sdk/issues/599
COPY pornilarity_model /opt/ml/model/export/Servo/1/

# Copy NGINX configuration to the container
COPY nginx.conf /opt/ml/code/nginx.conf

# Copies the hosting code inside the container
# COPY serve.py /opt/ml/code/serve.py

# Defines serve.py as script entrypoint
# ENV SAGEMAKER_PROGRAM serve.py

# starts NGINX and TF serving pointing to our model
ENTRYPOINT service nginx start | tensorflow_model_server --rest_api_port=8501 \
 --model_name=pornilarity_model \
 --model_base_path=/opt/ml/model/export/Servo/

构建并推送

%%sh

# The name of our algorithm
ecr_repo=sagemaker-tf-serving
docker_image=sagemaker-tf-serving

cd container

# chmod a+x container/serve.py

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-eu-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${ecr_repo}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${ecr_repo}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ecr_repo}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${docker_image} .
# docker tag ${docker_image} ${fullname}
docker tag ${docker_image}:latest ${fullname}

docker push ${fullname}

创建SageMaker模型

#!/usr/bin/env bash

CONTAINER_NAME="Pornilarity-Container"
MODEL_NAME=pornilarity-model-v1

# the role named created with
# https://gist.github.com/mvsusp/599311cb9f4ee1091065f8206c026962
ROLE_NAME=AmazonSageMaker-ExecutionRole-20191202T133391

# the name of the image created with
# https://gist.github.com/mvsusp/07610f9cfecbec13fb2b7c77a2e843c4
ECS_IMAGE_NAME=sagemaker-tf-serving
# the role arn of the role
EXECUTION_ROLE_ARN=$(aws iam get-role --role-name ${ROLE_NAME} | jq -r .Role.Arn)

# the ECS image URI
ECS_IMAGE_URI=$(aws ecr describe-repositories --repository-name ${ECS_IMAGE_NAME} |\
jq -r .repositories[0].repositoryUri)

# defines the SageMaker model primary container image as the ECS image
PRIMARY_CONTAINER="ContainerHostname=${CONTAINER_NAME},Image=${ECS_IMAGE_URI}"

# Createing the model
aws sagemaker create-model --model-name ${MODEL_NAME} \
--primary-container=${PRIMARY_CONTAINER}  --execution-role-arn ${EXECUTION_ROLE_ARN}

端点配置

#!/usr/bin/env bash

MODEL_NAME=pornilarity-model-v1

ENDPOINT_CONFIG_NAME=pornilarity-model-v1-config

ENDPOINT_NAME=pornilarity-v1-endpoint

PRODUCTION_VARIANTS="VariantName=Default,ModelName=${MODEL_NAME},"\
"InitialInstanceCount=1,InstanceType=ml.c5.large"

aws sagemaker create-endpoint-config --endpoint-config-name ${ENDPOINT_CONFIG_NAME} \
--production-variants ${PRODUCTION_VARIANTS}

aws sagemaker create-endpoint --endpoint-name ${ENDPOINT_NAME} \
--endpoint-config-name ${ENDPOINT_CONFIG_NAME}

Docker容器文件夹结构

├── container
│   ├── Dockerfile
│   ├── nginx.conf
│   ├── pornilarity_model
│   │   ├── assets
│   │   ├── saved_model.pb
│   │   └── variables
│   │       ├── variables.data-00000-of-00002
│   │       ├── variables.data-00001-of-00002
│   │       └── variables.index

任何指导将不胜感激!

0 个答案:

没有答案