sagemaker笔记本实例Elastic Inference tensorflow模型本地部署

时间:2020-06-17 06:25:29

标签: tensorflow amazon-sagemaker

我正在尝试复制https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_serving_using_elastic_inference_with_your_own_model/tensorflow_serving_pretrained_model_elastic_inference.ipynb

我的弹性推理加速器已连接到笔记本实例。我正在使用conda_amazonei_tensorflow_p36内核。根据文档,我对本地EI进行了更改:

%%time
import boto3

region = boto3.Session().region_name
saved_model = 's3://sagemaker-sample-data-{}/tensorflow/model/resnet/resnet_50_v2_fp32_NCHW.tar.gz'.format(region)

import sagemaker
from sagemaker.tensorflow.serving import Model

role = sagemaker.get_execution_role()

tensorflow_model = Model(model_data=saved_model,
role=role,
framework_version='1.14')
tf_predictor = tensorflow_model.deploy(initial_instance_count=1,
instance_type='local',
accelerator_type='local_sagemaker_notebook')

我正在笔记本中获取以下日志:

Attaching to tmp6uqys1el_algo-1-7ynb1_1
algo-1-7ynb1_1 | INFO:main:starting services
algo-1-7ynb1_1 | INFO:main:using default model name: Servo
algo-1-7ynb1_1 | INFO:main:tensorflow serving model config:
algo-1-7ynb1_1 | model_config_list: {
algo-1-7ynb1_1 | config: {
algo-1-7ynb1_1 | name: "Servo",
algo-1-7ynb1_1 | base_path: "/opt/ml/model/export/Servo",
algo-1-7ynb1_1 | model_platform: "tensorflow"
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | INFO:main:nginx config:
algo-1-7ynb1_1 | load_module modules/ngx_http_js_module.so;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | worker_processes auto;
algo-1-7ynb1_1 | daemon off;
algo-1-7ynb1_1 | pid /tmp/nginx.pid;
algo-1-7ynb1_1 | error_log /dev/stderr error;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | worker_rlimit_nofile 4096;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | events {
algo-1-7ynb1_1 | worker_connections 2048;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | http {
algo-1-7ynb1_1 | include /etc/nginx/mime.types;
algo-1-7ynb1_1 | default_type application/json;
algo-1-7ynb1_1 | access_log /dev/stdout combined;
algo-1-7ynb1_1 | js_include tensorflow-serving.js;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | upstream tfs_upstream {
algo-1-7ynb1_1 | server localhost:8501;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | upstream gunicorn_upstream {
algo-1-7ynb1_1 | server unix:/tmp/gunicorn.sock fail_timeout=1;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | server {
algo-1-7ynb1_1 | listen 8080 deferred;
algo-1-7ynb1_1 | client_max_body_size 0;
algo-1-7ynb1_1 | client_body_buffer_size 100m;
algo-1-7ynb1_1 | subrequest_output_buffer_size 100m;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | set $tfs_version 1.14;
algo-1-7ynb1_1 | set $default_tfs_model Servo;
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location /tfs {
algo-1-7ynb1_1 | rewrite ^/tfs/(.) /$1 break;
algo-1-7ynb1_1 | proxy_redirect off;
algo-1-7ynb1_1 | proxy_pass_request_headers off;
algo-1-7ynb1_1 | proxy_set_header Content-Type 'application/json';
algo-1-7ynb1_1 | proxy_set_header Accept 'application/json';
algo-1-7ynb1_1 | proxy_pass http://tfs_upstream;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location /ping {
algo-1-7ynb1_1 | js_content ping;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location /invocations {
algo-1-7ynb1_1 | js_content invocations;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location ~ ^/models/(.)/invoke {
algo-1-7ynb1_1 | js_content invocations;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location /models {
algo-1-7ynb1_1 | proxy_pass http://gunicorn_upstream/models;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | location / {
algo-1-7ynb1_1 | return 404 '{"error": "Not Found"}';
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | keepalive_timeout 3;
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 | }
algo-1-7ynb1_1 |
algo-1-7ynb1_1 |
algo-1-7ynb1_1 | INFO:main:tensorflow version info:
algo-1-7ynb1_1 | TensorFlow ModelServer: 1.14.0-rc0+dev.sha.34d9e85
algo-1-7ynb1_1 | TensorFlow Library: 1.14.0
algo-1-7ynb1_1 | EI Version: EI-1.4
algo-1-7ynb1_1 | INFO:main:tensorflow serving command: tensorflow_model_server --port=9000 --rest_api_port=8501 --model_config_file=/sagemaker/model-config.cfg
algo-1-7ynb1_1 | INFO:main:started tensorflow serving (pid: 8)
algo-1-7ynb1_1 | INFO:main:nginx version info:
algo-1-7ynb1_1 | nginx version: nginx/1.16.1
algo-1-7ynb1_1 | built by gcc 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
algo-1-7ynb1_1 | built with OpenSSL 1.0.2g 1 Mar 2016
algo-1-7ynb1_1 | TLS SNI support enabled
algo-1-7ynb1_1 | configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -fPIE -pie -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'
algo-1-7ynb1_1 | INFO:main:started nginx (pid: 10)
algo-1-7ynb1_1 | 2020-06-17 05:02:08.888114: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
algo-1-7ynb1_1 | 2020-06-17 05:02:08.888186: I tensorflow_serving/model_servers/server_core.cc:561] (Re-)adding model: Servo
algo-1-7ynb1_1 | 2020-06-17 05:02:08.988623: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:08.988688: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:08.988728: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:08.988762: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | 2020-06-17 05:02:08.988783: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | 2020-06-17 05:02:09.001922: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
algo-1-7ynb1_1 | 2020-06-17 05:02:09.082734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
algo-1-7ynb1_1 | 2020-06-17 05:02:09.613725: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | Using Amazon Elastic Inference Client Library Version: 1.5.3
algo-1-7ynb1_1 | Number of Elastic Inference Accelerators Available: 1
algo-1-7ynb1_1 | Elastic Inference Accelerator ID: eia-813285f77ceb448c849e2331116f251b
algo-1-7ynb1_1 | Elastic Inference Accelerator Type: eia2.medium
algo-1-7ynb1_1 | Elastic Inference Accelerator Ordinal: 0
algo-1-7ynb1_1 |
!algo-1-7ynb1_1 | 172.18.0.1 - - [17/Jun/2020:05:02:10 +0000] "GET /ping HTTP/1.1" 200 3 "-" "-"
algo-1-7ynb1_1 | [Wed Jun 17 05:02:11 2020, 662569us] [Execution Engine] Error getting application context for [TensorFlow][2]
algo-1-7ynb1_1 | [Wed Jun 17 05:02:11 2020, 662722us] [Execution Engine][TensorFlow][2] Failed - Last Error:
algo-1-7ynb1_1 | EI Error Code: [3, 16, 8]
algo-1-7ynb1_1 | EI Error Description: Unable to authenticate with accelerator
algo-1-7ynb1_1 | EI Request ID: TF-D66B9810-D81A-448F-ACE2-703FFFA0F194 -- EI Accelerator ID: eia-813285f77ceb448c849e2331116f251b
algo-1-7ynb1_1 | EI Client Version: 1.5.3
algo-1-7ynb1_1 | 2020-06-17 05:02:11.668412: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1219] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator whitelist from server.
algo-1-7ynb1_1 | WARNING:main:unexpected tensorflow serving exit (status: 6). restarting.
algo-1-7ynb1_1 | INFO:main:tensorflow version info:
algo-1-7ynb1_1 | TensorFlow ModelServer: 1.14.0-rc0+dev.sha.34d9e85
algo-1-7ynb1_1 | TensorFlow Library: 1.14.0
algo-1-7ynb1_1 | EI Version: EI-1.4
algo-1-7ynb1_1 | INFO:main:tensorflow serving command: tensorflow_model_server --port=9000 --rest_api_port=8501 --model_config_file=/sagemaker/model-config.cfg
algo-1-7ynb1_1 | INFO:main:started tensorflow serving (pid: 38)`enter code here`
algo-1-7ynb1_1 | 2020-06-17 05:02:11.759706: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
algo-1-7ynb1_1 | 2020-06-17 05:02:11.759783: I tensorflow_serving/model_servers/server_core.cc:561] (Re-)adding model: Servo
algo-1-7ynb1_1 | 2020-06-17 05:02:11.860242: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:11.860309: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:11.860333: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: Servo version: 1527887769}
algo-1-7ynb1_1 | 2020-06-17 05:02:11.860365: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | 2020-06-17 05:02:11.860382: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | 2020-06-17 05:02:11.873381: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
algo-1-7ynb1_1 | 2020-06-17 05:02:11.949421: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
algo-1-7ynb1_1 | 2020-06-17 05:02:12.512935: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /opt/ml/model/export/Servo/1527887769
algo-1-7ynb1_1 | Using Amazon Elastic Inference Client Library Version: 1.5.3
algo-1-7ynb1_1 | Number of Elastic Inference Accelerators Available: 1
algo-1-7ynb1_1 | Elastic Inference Accelerator ID: eia-813285f77ceb448c849e2331116f251b
algo-1-7ynb1_1 | Elastic Inference Accelerator Type: eia2.medium
algo-1-7ynb1_1 | Elastic Inference Accelerator Ordinal: 0
`

日志永远不会在笔记本中停止。它不断抛出笔记本电池。我不确定模型是否正确部署。

我可以看到模型的泊坞窗正在运行 enter image description here

当我尝试从该模型推断/预测时,出现错误:

algo-1-iikpj_1 | [Wed Jun 17 05:29:47 2020, 761607us] [Execution Engine] Error getting application context for [TensorFlow][2]

algo-1-iikpj_1 | [Wed Jun 17 05:29:47 2020, 761691us] [Execution Engine][TensorFlow][2] Failed - Last Error:
algo-1-iikpj_1 | EI Error Code: [3, 16, 8]
algo-1-iikpj_1 | EI Error Description: Unable to authenticate with accelerator
algo-1-iikpj_1 | EI Request ID: TF-ADECD8EF-7138-4B5F-9C37-ADFDC8122DF1 -- EI Accelerator ID: eia-813285f77ceb448c849e2331116f251b
algo-1-iikpj_1 | EI Client Version: 1.5.3
algo-1-iikpj_1 | 2020-06-17 05:29:47.768249: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1219] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator whitelist from server.
algo-1-iikpj_1 | WARNING:main:unexpected tensorflow serving exit (status: 6). restarting.
algo-1-iikpj_1 | INFO:main:tensorflow version info:
algo-1-iikpj_1 | TensorFlow ModelServer: 1.14.0-rc0+dev.sha.34d9e85
algo-1-iikpj_1 | TensorFlow Library: 1.14.0
algo-1-iikpj_1 | EI Version: EI-1.4
algo-1-iikpj_1 | INFO:main:tensorflow serving command: tensorflow_model_server --port=9000 --rest_api_port=8501 --model_config_file=/sagemaker/model-config.cfg
algo-1-iikpj_1 | INFO:main:started tensorflow serving (pid: 1052)
algo-1-iikpj_1 | 2020-06-17 05:29:47.854331: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
algo-1-iikpj_1 | 2020-06-17 05:29:47.854405: I tensorflow_serving/model_servers/server_core.cc:561] (Re-)adding model: Servo
algo-1-iikpj_1 | 2020/06/17 05:29:47 [error] 11#11: *2 connect() failed (111: Connection refused) while connecting to upstream, client: 172.18.0.1, server: , request: "POST /invocations HTTP/1.1", subrequest: "/v1/models/Servo:predict", upstream: "http://127.0.0.1:8501/v1/models/Servo:predict", host: "localhost:8080"
algo-1-iikpj_1 | 2020/06/17 05:29:47 [error] 11#11: *2 connect() failed (111: Connection refused) while connecting to upstream, client: 172.18.0.1, server: , request: "POST /invocations HTTP/1.1", subrequest: "/v1/models/Servo:predict", upstream: "http://127.0.0.1:8501/v1/models/Servo:predict", host: "localhost:8080"
algo-1-iikpj_1 | 172.18.0.1 - - [17/Jun/2020:05:29:47 +0000] "POST /invocations HTTP/1.1" 502 157 "-" "-"
algo-1-iikpj_1 | 2020-06-17 05:29:47.954825: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: Servo version: 1527887769}
algo-1-iikpj_1 | 2020-06-17 05:29:47.954887: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: Servo version: 1527887769}
algo-1-iikpj_1 | 2020-06-17 05:29:47.955448: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: Servo version: 1527887769}
algo-1-iikpj_1 | 2020-06-17 05:29:47.955494: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1527887769
algo-1-iikpj_1 | 2020-06-17 05:29:47.955859: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/export/Servo/1527887769
algo-1-iikpj_1 | 2020-06-17 05:29:47.969511: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
JSONDecodeError Traceback (most recent call last)
in ()

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/serving.py in predict(self, data, initial_args)
116 args["CustomAttributes"] = self._model_attributes
117
--> 118 return super(Predictor, self).predict(data, args)
119
120

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model)
109 request_args = self._create_request_args(data, initial_args, target_model)
110 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
--> 111 return self._handle_response(response)
112
113 def _handle_response(self, response):

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in _handle_response(self, response)
119 if self.deserializer is not None:
120 # It's the deserializer's responsibility to close the stream
--> 121 return self.deserializer(response_body, response["ContentType"])
122 data = response_body.read()
123 response_body.close()

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in call(self, stream, content_type)
578 """
579 try:
--> 580 return json.load(codecs.getreader("utf-8")(stream))
581 finally:
582 stream.close()

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

algo-1-iikpj_1 | 2020-06-17 05:29:48.047106: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
algo-1-iikpj_1 | 2020-06-17 05:29:48.564452: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /opt/ml/model/export/Servo/1527887769
algo-1-iikpj_1 | Using Amazon Elastic Inference Client Library Version: 1.5.3

我尝试了几种方法来解决JSONDecodeError:期望值:使用json.loads,json.dumps等的第1行第1列(字符0),但没有任何帮助。 我也尝试过将Rest API发布到docker部署模型:

curl -v -X POST \ -H 'content-type:application/json' \ -d '{"data": {"inputs": [[[[0.13075708159043742, 0.048010725848070535, 0.9012465727287071], [0.1643217202482622, 0.7392467524276859, 0.5618572640643519], [0.7697097217983989, 0.9829998452540657, 0.08567413146192027]]]]} }' \ http://127.0.0.1:8080/v1/models/Servo:predict
but still getting error:
[![enter image description here][1]][1]

请帮助我解决问题。最初,我试图使用我的tensorflow服务模型并得到相同的错误。然后,我想到了使用与AWS示例笔记本中使用的模型相同的模型(resnet_50_v2_fp32_NCHW.tar.gz')。因此,以上实验是使用具有sagemaker-sample-data提供的模型的AWS示例笔记本。

请帮帮我。谢谢

1 个答案:

答案 0 :(得分:0)

解决了。我遇到的错误是由于笔记本附带的角色/弹性推理权限所致。一旦由我们的devops小组修复了这些权限。它按预期工作。参见https://github.com/aws/sagemaker-tensorflow-serving-container/issues/142