我正在尝试为Amazon Sagemaker创建自定义模型/图像/容器。 我已经阅读了所有基础教程,以及如何根据您的要求创建图像。实际上,我有一个正确设置的图像,该图像可以运行tensorflow,在本地训练,部署和提供模型。
当我尝试使用sagemaker python SDK运行容器时,问题就来了。更准确地说,尝试使用Framework模块和Class创建我自己的自定义估算器,以运行自定义图像/容器。
在这里,我发布了解释我情况的最小代码:
文件结构:
.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
| ├── train
| ├── serve
| ├── predict.py
| └── custom_code/my_model_functions
|
└── local_test
├── train_local.sh
├── serve_local.sh
├── predict.sh
└── test_dir
├── model/model.pkl
├── output/output.txt
└── input
├── data/data.pkl
└── config
├── hyperparameters.json
├── inputdataconfig.json
└── resourceconfig.json
dockerfile。
FROM ubuntu:16.04
MAINTAINER Amazon AI <sage-learner@amazon.com>
# Install python and other runtime dependencies
RUN apt-get update && \
apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
apt-get -y install python3-dev python3-setuptools
# Install pip
RUN cd /tmp && \
curl -O https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py && \
rm get-pip.py
# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt
# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables
COPY local_test/test_dir /opt/ml
# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program
火车
from __future__ import absolute_import
import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint
# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset
# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env
logger = logging.getLogger(__name__)
def run_algorithm_mode():
"""Run training in algorithm mode, which does not require a user entry point. """
train_config = os.environ.get('training_env_variables')
model_path = os.environ.get("model_path")
print("Downloading Dataset")
train_dataset, test_dataset = create_dataset(None)
print("Creating Model")
clf = CustomEstimator.create_model(train_config)
print("Starting Training")
clf = clf.train_model(train_dataset, test_dataset)
print("Saving Model")
module_name = 'classifier.pkl'
CustomEstimator.save_model(clf, model_path)
def train(training_environment):
"""Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
The user supplied module and its dependencies are downloaded from S3.
Training is invoked by calling a "train" function in the user supplied module.
Args:
training_environment: training environment object containing environment variables,
training arguments and hyperparameters
"""
if training_environment.user_entry_point is not None:
print("Entry Point Receive")
framework.modules.run_module(training_environment.module_dir,
training_environment.to_cmd_args(),
training_environment.to_env_vars(),
training_environment.module_name,
capture_error=False)
print_directories()
else:
logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
try:
_env.write_env_vars(training_environment.to_env_vars())
except Exception as error:
print(error)
run_algorithm_mode()
def main():
train(framework.training_env())
sys.exit(0)
if __name__ == '__main__':
main()
test_sagemaker.ipynb
我使用sagemaker估算器的Framework类创建了此自定义sagemaker估算器。
import boto3
from sagemaker.estimator import Framework
class ScriptModeTensorFlow(Framework):
"""This class is temporary until the final version of Script Mode is released.
"""
__framework_name__ = "tensorflow-scriptmode"
create_model = TensorFlow.create_model
def __init__(
self,
entry_point,
source_dir=None,
hyperparameters=None,
py_version="py3",
image_name=None,
**kwargs
):
super(ScriptModeTensorFlow, self).__init__(
entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
)
self.py_version = py_version
self.image_name = None
self.framework_version = '2.0.0'
self.user_entry_point = entry_point
print(self.user_entry_point)
然后创建传递 entry_point 和图像(该类需要运行的所有其他参数)的估算器。
estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
image_name='sagemaker-custom-image:latest',
source_dir='source_dir_path/input/config',
train_instance_type='local', # Run in local mode
train_instance_count=1,
hyperparameters=hyperparameters,
py_version='py3',
role=role)
最后,参加培训...
estimator.fit({"train": "s3://s3-bucket-path/training_data"})
但出现以下错误:
Creating tmpm3ft7ijm_algo-1-mjqkd_1 ...
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1 | Reporting training FAILURE
algo-1-mjqkd_1 | framework error:
algo-1-mjqkd_1 | Traceback (most recent call last):
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1 | env = sagemaker_containers.training_env()
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1 | resource_config=_env.read_resource_config(),
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1 | return _read_json(resource_config_file_dir)
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1 | with open(path, "r") as f:
algo-1-mjqkd_1 | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1 |
algo-1-mjqkd_1 | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1 | Traceback (most recent call last):
algo-1-mjqkd_1 | File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1 | subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1 | File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1 | raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1 | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...
乍看之下,错误似乎很明显,文件'/opt/ml/input/config/resourceconfig.json'丢失了。关键是我无法创建此文件,因此sagemaker框架可以获取用于多处理的主机(我还不需要它们)。 当我按照下面的文件夹结构创建图像'sagemaker-custom-image:latest'时,我已经将'resoruceconfig.json'赋予了图像内的'/ opt / ml / input / config /'文件夹。 / p>
/opt/ml
├── input
│ ├── config
│ │ ├── hyperparameters.json
│ │ ├── inputdataconfig.json
│ │ └── resourceConfig.json
│ └── data
│ └── <channel_name>
│ └── <input data>
├── model
│ └── <model files>
└── output
└── failure
阅读AWS中的文档时,使用sagemaker sdk运行映像时,它说在培训期间可能不再可见“ opt / ml”文件夹中容器中的所有文件。
/ opt / ml和所有子目录由Amazon SageMaker培训保留。建立算法的docker映像时,请确保不要在其下放置算法所需的任何数据,因为在训练期间这些数据可能不再可见。How Amazon SageMaker Runs Your Training Image
这基本上可以解决我的问题。
是的,我知道我可以利用sagemaker的预构建估计器和图像。
是的,我知道我可以绕过框架框架并从docker run中运行图像训练。
但是我需要实现一个完全定制的sagemaker sdk / image / container / model以便与入口点一起使用。我知道有点野心。
要重新阐述我的问题:如何获取Sagemaker框架或SDK在映像中创建require resourceconfig.json文件?
答案 0 :(得分:1)
显然,远程运行图像可以解决此问题。 我正在使用远程aws机'ml.m5.large'。 sagemaker sdk代码中的某个位置正在创建并提供图像所需的文件。但是仅当在远程计算机上运行时,而不是在本地计算机上运行。