将Automl作业提交给批处理AI集群时出错

时间:2018-11-12 15:45:44

标签: automl azure-notebooks

我正在尝试在我的Batch AI集群上运行automl数字分类示例。此单元格似乎可以正常运行:

from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
batchai_cluster_name = "nc12-cluster-04"

found = False
# Check if this compute target already exists in the workspace.
for ct_name, ct in ws.compute_targets().items():
    if (ct.name == batchai_cluster_name and ct.type == 'BatchAI'):
        found = True
        print('Found existing compute target: {0}'.format(ct.name))
        compute_target = ct
        break

if not found:
    print('Creating a new compute target...')
    provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_NC12", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                autoscale_enabled = True,
                                                                cluster_min_nodes = 0, 
                                                                cluster_max_nodes = 4)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)

     # For a more detailed view of current Batch AI cluster status, use the 'status' property.

我得到了预期的输出:

Found existing compute target: nc12-cluster-04

但是当我尝试像这样提交工作时:

from azureml.core.experiment import Experiment

experiment = Experiment(ws, experiment_name)
remote_run = experiment.submit(automl_config, show_output = False)

我收到此错误:

~/anaconda3_501/lib/python3.6/site-packages/azureml/_restclient/operations/jasmine_operations.py in post_remote_snapshot_run(self, subscription_id, resource_group_name, workspace_name, project_name, parent_run_id, json_definition, snapshot_id, custom_headers, raw, **operation_config)
    237         if response.status_code not in [200]:
--> 238             raise HttpOperationError(self._deserialize, response)
    239 

HttpOperationError: Operation returned an invalid status code "ErrorMessage: 'BatchAI cluster nc12-cluster-04f7543302 does not exist'. Possible cause: Remote dsvm"

请注意,群集名称现在有一个后缀:nc12-cluster-04 f7543302 ,该前缀不在传递的compute_target中:

vars(compute_target)

{
...
 'name': 'nc12-cluster-04',
 'provisioning_errors': None,
 'provisioning_state': 'Succeeded',
 ...
}

有什么想法吗?

0 个答案:

没有答案