我正在关注本文https://machinelearningmastery.com/how-to-perform-object-detection-with-yolov3-in-keras/,以便在AWS Sagemaker中部署YOLOv3。我有具有权重的model.weights和具有模型结构的model.json,还有具有模型结构+权重的model.h5。当我将这些文件转换为protobuf格式,以便可以将它们压缩并部署到Sagemaker上时,就会出现此错误。
UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-12-10-57-05-
567: Failed. Reason: The primary container for production variant AllTraffic did not pass the
ping health check. Please check CloudWatch logs for this endpoint..
这是我的代码:
import tensorflow
tensorflow.__version__
Output:
'1.7.0'
import boto3, re
from sagemaker import get_execution_role
role = get_execution_role()
from tensorflow.keras.models import model_from_json
!ls keras_model/
import tensorflow as tf
json_file = open('/home/ec2-user/SageMaker/keras_model/' + 'model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json, custom_objects={"GlorotUniform": tf.keras.initializers.glorot_uniform})
loaded_model.load_weights('/home/ec2-user/SageMaker/keras_model/model_weights.h5')
print("Loaded model from disk ")
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
# this directory sturcture will be followed as below. Do not change it.
model_version = '1'
export_dir = 'export/Servo/' + model_version
#Build the protocol buffer savedmodel at export_dir
build = builder.SavedModelBuilder(export_dir)
print(loaded_model.inputs)
print([t for t in loaded_model.outputs])
Output:
[<tf.Tensor 'input_1:0' shape=(?, ?, ?, 3) dtype=float32>]
[<tf.Tensor 'conv_81/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, <tf.Tensor 'conv_93/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>, <tf.Tensor 'conv_105/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>]
tf.convert_to_tensor(loaded_model.output)
Output:
<tf.Tensor 'packed:0' shape=(3, ?, ?, ?, 255) dtype=float32>
signature = predict_signature_def(inputs={"input_image": loaded_model.input},
outputs={t.name: t for t in loaded_model.outputs})
from tensorflow.keras import backend as K
with K.get_session() as sess:
build.add_meta_graph_and_variables(sess=sess, tags=[tag_constants.SERVING],
signature_def_map={"serving_default":signature} )
build.save()
!ls export/Servo/1/variables/
Output:
variables.data-00000-of-00001 variables.index
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)
import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
!touch train.py
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data='s3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
entry_point= 'train.py')
%%time
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.t2.large')
错误:
-----------------------------*
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<timed exec> in <module>()
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait, data_capture_config)
478 kms_key=kms_key,
479 wait=wait,
--> 480 data_capture_config_dict=data_capture_config_dict,
481 )
482
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
2849
2850 self.sagemaker_client.create_endpoint_config(**config_options)
-> 2851 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
2852
2853 def expand_role(self, role):
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
2381 )
2382 if wait:
-> 2383 self.wait_for_endpoint(endpoint_name)
2384 return endpoint_name
2385
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
2638 ),
2639 allowed_statuses=["InService"],
-> 2640 actual_status=status,
2641 )
2642 return desc
UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-12-10-57-05-567: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
我认为错误是由于loaded_model.inputs
和loaded_model.outputs
的张量形状之间的差异引起的。但是我仍然不确定3和255代表什么形状。任何帮助将不胜感激。
print(loaded_model.inputs)
[<tf.Tensor 'input_1:0' shape=(?, ?, ?, 3) dtype=float32>]
print([t for t in loaded_model.outputs])
[<tf.Tensor 'conv_81/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>,
<tf.Tensor 'conv_93/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>,
<tf.Tensor 'conv_105/BiasAdd:0' shape=(?, ?, ?, 255) dtype=float32>]
Cloudwatch日志:
2020-04-12 11:01:33,439 INFO - root - running container entrypoint
2020-04-12 11:01:33,440 INFO - root - starting serve task
2020-04-12 11:01:33,440 INFO - container_support.serving - reading config
Downloading s3://sagemaker-us-east-1-611475884433/sagemaker-tensorflow-2020-04-12-10-57-05-375/sourcedir.tar.gz to /tmp/script.tar.gz
2020-04-12 11:01:33,828 INFO - container_support.serving - importing user module
2020-04-12 11:01:33,828 INFO - container_support.serving - loading framework-specific dependencies
2020-04-12 11:01:35,795 INFO - container_support.serving - starting nginx
2020-04-12 11:01:35,797 INFO - container_support.serving - nginx config:
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;
worker_rlimit_nofile 4096;
events {
worker_connections 2048;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log combined;
upstream gunicorn {
server unix:/tmp/gunicorn.sock;
}
server {
listen 8080 deferred;
client_max_body_size 0;
keepalive_timeout 3;
location ~ ^/(ping|invocations) {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://gunicorn;
}
location / {
return 404 "{}";
}
}
}
2020-04-12 11:01:35,815 INFO - container_support.serving - starting gunicorn
2020-04-12 11:01:35,820 INFO - container_support.serving - inference server started. waiting on processes: set([24, 23])
2020-04-12 11:01:35.904746: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: generic_model model_base_path: /opt/ml/model/export/Servo
2020-04-12 11:01:35.905995: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-04-12 11:01:35.906148: I tensorflow_serving/model_servers/server_core.cc:517] (Re-)adding model: generic_model
2020-04-12 11:01:35.907173: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: generic_model version: 1}
2020-04-12 11:01:35.907349: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: generic_model version: 1}
2020-04-12 11:01:35.907422: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: generic_model version: 1}
2020-04-12 11:01:35.907578: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1
2020-04-12 11:01:35.907687: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/export/Servo/1
2020-04-12 11:01:35.939232: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-04-12 11:01:35.980215: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[2020-04-12 11:01:36 +0000] [24] [INFO] Starting gunicorn 19.9.0
2020-04-12 11:01:36.048327: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: fail. Took 140502 microseconds.
2020-04-12 11:01:36.048617: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: generic_model version: 1} failed: Not found: Op type not registered 'FusedBatchNormV3' in binary running on model.aws.local. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.