ValueError:托管端点时出错。生产变型AllTraffic的主容器未通过ping健康检查

时间:2019-06-21 15:21:49

标签: python scikit-learn amazon-sagemaker

我正在尝试在Amazon Sagemaker上部署SKlearn模型,并正在研究其文档中提供的示例,并且在部署模型时遇到上述错误。

我正在按照this notebook中提供的说明进行操作,到目前为止,已经复制并粘贴了他们所拥有的代码。

现在,这是我在jupyter笔记本中的确切代码:

# S3 prefix
prefix = 'Scikit-iris'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs('./iris', exist_ok=True)
np.savetxt('./iris/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, 
%1.3f, %1.3f, %1.3f')

WORK_DIRECTORY = 'data'

train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )

from sagemaker.sklearn.estimator import SKLearn

script_path = 'scikit_learn_iris.py'

sklearn = SKLearn(
  entry_point=script_path,
  train_instance_type="ml.c4.xlarge",
  role=role,
  sagemaker_session=sagemaker_session,
  framework_version='0.20.0',
  hyperparameters={'max_leaf_nodes': 30})

sklearn.fit({'train': train_input})

sklearn.deploy(instance_type='ml.m4.xlarge',
                                 initial_instance_count=1)

然后我收到错误消息。

'scikit_learn_iris.py'的内容如下:

import argparse
import pandas as pd
import os
import numpy as np

from sklearn import tree
from sklearn.externals import joblib

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

# Hyperparameters are described here. In this simple example we are just including one hyperparameter.
parser.add_argument('--max_leaf_nodes', type=int, default=-1)

# SageMaker specific arguments. Defaults are set in the environment variables.
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

args = parser.parse_args()

# Take the set of files and read them all into a single pandas dataframe
input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
if len(input_files) == 0:
    raise ValueError(('There are no files in {}.\n' +
                      'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                      'the data specification in S3 was incorrectly specified or the role specified\n' +
                      'does not have permission to access the data.').format(args.train, "train"))
raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
train_data = pd.concat(raw_data)

# labels are in the first column
train_y = train_data.ix[:,0].astype(np.int)
train_X = train_data.ix[:,1:]

# We determine the number of leaf nodes using the hyper-parameter above.
max_leaf_nodes = args.max_leaf_nodes

# Now use scikit-learn's decision tree classifier to train the model.
clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
clf = clf.fit(train_X, train_y)

# Save the decision tree model.
joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))

我的cloudwatch日志如下:

enter image description here

1 个答案:

答案 0 :(得分:1)

基于CloudWatch日志中的错误,脚本缺少notebook中提供的model_fn定义。为了方便起见,我在这里重复了该功能:

def model_fn(model_dir):
    return joblib.load(os.path.join(model_dir, "model.joblib"))

尝试将其附加到脚本的底部,然后重新运行笔记本。