Sagemaker部署模型失败

时间:2020-06-22 20:32:03

标签: amazon-sagemaker

我正在尝试将SageMaker SKLearnModel部署到端点,但是始终出现以下错误。我无法弄清楚将哪些环境变量传递给SageMaker部署实例以尝试对其进行调试,也无法弄清楚为什么无法通过s3下载code_dir,同样的source_dir可以用于培训完美。

我很感激它很难重现该问题,因为它很可能是SageMaker环境问题,但到目前为止我已经尝试了以下步骤:

  • 直接指向s3 source_dir位置
  • 删除主目录中的所有非必需代码
  • 在SageMaker上本地部署-出现类似的模糊错误
  • 该端点管道以前可以正常工作,并且能够部署经过培训的类似模型

对根本原因的思考:

  • 某些环境变量正在传递给_env
  • 连接到s3可能是一个问题,尽管这会导致训练失败

我们非常感谢您的帮助,我希望这是一个普通的问题,将来可以帮助其他人

代码:

source_dir = '/home/ec2-user/SageMaker/home_directory'
role = get_execution_role()
trained_model_location = 's3://bucket/training_job/output/model.tar.gz'
hosted_live_main = SKLearnModel(
    entry_point='generate.py',
    role=role,
    model_data=trained_model_location,
    source_dir=source_dir,
    image='DOCKER_IMAGE_NAME',
    env = {
        'BUCKET_NAME': bucket_name,
        'AWS_SECRET_ACCESS_KEY': 'XXXXX',
        'AWS_ACCESS_KEY_ID': 'XXXXX,
        'AWS_REGION_NAME': 'us-east-1'},
                  
)
end = hosted_live_main.deploy(
    1, 'ml.t2.medium', endpoint_name='endpoint_name')

收到错误:

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
Traceback (most recent call last):
  File "/miniconda3/bin/serve", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/cli/serve.py", line 19, in main
    server.start(env.ServingEnv().framework_module)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_server.py", line 86, in start
    _modules.import_module(env.module_dir, env.module_name)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_modules.py", line 253, in import_module
    _files.download_and_extract(uri, _env.code_dir)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_files.py", line 129, in download_and_extract
    s3_download(uri, dst)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_files.py", line 165, in s3_download
    s3.Bucket(bucket).download_file(key, dst)
  File "/miniconda3/lib/python3.7/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
  File "/miniconda3/lib/python3.7/site-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/miniconda3/lib/python3.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/miniconda3/lib/python3.7/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/miniconda3/lib/python3.7/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
  File "/miniconda3/lib/python3.7/site-packages/s3transfer/tasks.py", line 255, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/miniconda3/lib/python3.7/site-packages/s3transfer/download.py", line 343, in _submit
    **transfer_future.meta.call_args.extra_args
  File "/miniconda3/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/miniconda3/lib/python3.7/site-packages/botocore/client.py", line 626, in _make_api_call
    raise error_class(parsed_response, operation_name)

0 个答案:

没有答案