我有一个在Azure机器学习服务上的机器学习计算上训练过的模型。注册的模型已经存在于我的工作空间中,我想将其部署到我先前在我的工作空间中预配置的AKS实例。我能够成功配置和注册容器映像:
# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)
<azureml.core.model.Model object at 0x7f66ab3b1f98>
<azureml.core.model.Model object at 0x7f66ab7e49b0>
<azureml.core.model.Model object at 0x7f66ab85e710>
package_list = [
'category-encoders==1.3.0',
'numpy==1.15.0',
'pandas==0.24.1',
'scikit-learn==0.20.2']
# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)
conda_yml = 'file:'+os.getcwd()+'/myenv.yml'
with open(conda_yml,"w") as f:
f.write(myenv.serialize_to_string())
配置和注册图像作品:
# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py',
runtime='python',
conda_file='myenv.yml',
description='Pumps Random Forest model')
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
Creating image
Running....................
SucceededImage creation operation finished for image pumpsrfimage:2, operation "Succeeded"
附加到现有集群也可以:
# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws,
name=Config.DEPLOY_COMPUTE,
attach_configuration=attach_config)
# Wait for the operation to complete
aks_target.wait_for_completion(True)
SucceededProvisioning operation finished, operation "Succeeded"
但是,当我尝试将映像部署到现有群集时,它失败并显示WebserviceException
。
# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()
# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
name = 'pumps-aks-service-1' ,
image = image,
deployment_config = aks_config,
deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)
WebserviceException: Unable to create service with image pumpsrfimage:1 in non "Succeeded" creation state.
---------------------------------------------------------------------------
WebserviceException Traceback (most recent call last)
<command-201219424688503> in <module>()
7 image = image,
8 deployment_config = aks_config,
----> 9 deployment_target = aks_target)
10 # Wait for the deployment to complete
11 service.wait_for_deployment(show_output = True)
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/webservice.py in deploy_from_image(workspace, name, image, deployment_config, deployment_target)
284 return child._deploy(workspace, name, image, deployment_config, deployment_target)
285
--> 286 return deployment_config._webservice_type._deploy(workspace, name, image, deployment_config, deployment_target)
287
288 @staticmethod
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/aks.py in _deploy(workspace, name, image, deployment_config, deployment_target)
关于如何解决此问题的任何想法?我正在Databricks笔记本中编写代码。另外,我能够使用Azure Portal创建和部署群集没有问题,因此这似乎是我的代码/ Python SDK或Databricks与AMLS配合使用的方式的问题。
更新: 我可以使用Azure门户将映像部署到AKS,并且Web服务可以按预期工作。这意味着问题出在Databricks,Azureml Python SDK和机器学习服务之间。
更新2: 我正在与Microsoft一起解决此问题。一旦有解决方案,将向您报告。
答案 0 :(得分:2)
在我的初始代码中,创建图像时,我没有使用:
image.wait_for_creation(show_output=True)
结果,我在创建错误图像之前调用了CreateImage
和DeployImage
。简直不敢这么简单。
更新的图像创建片段:
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
image.wait_for_creation(show_output=True)
答案 1 :(得分:1)
根据个人经验,我想说的是,您看到的错误消息可能表明图像中的脚本存在一些错误。此类错误不一定会阻止成功创建映像,但可能会阻止该映像在服务中使用。但是,如果您已经成功地将映像部署到其他服务中,那么应该可以排除此选项。
您可以关注this guide,以获取有关如何在本地调试Docker映像以及查找日志和其他有用信息的更多信息。
答案 2 :(得分:0)
同意Arvid的回答。您能够成功运行它吗?您也可以尝试将其部署到ACI,但是如果问题出在score.py中,您将遇到相同的问题,但是尝试起来很快。另外,如果您要调试部署,则会有些痛苦,但是您可以在本地docker部署中公开端口tcp 5678,并使用VSCode和PTVSD进行连接并逐步调试。