我正在尝试在我的Batch AI集群上运行automl数字分类示例。此单元格似乎可以正常运行:
from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget
# Choose a name for your cluster.
batchai_cluster_name = "nc12-cluster-04"
found = False
# Check if this compute target already exists in the workspace.
for ct_name, ct in ws.compute_targets().items():
if (ct.name == batchai_cluster_name and ct.type == 'BatchAI'):
found = True
print('Found existing compute target: {0}'.format(ct.name))
compute_target = ct
break
if not found:
print('Creating a new compute target...')
provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_NC12", # for GPU, use "STANDARD_NC6"
#vm_priority = 'lowpriority', # optional
autoscale_enabled = True,
cluster_min_nodes = 0,
cluster_max_nodes = 4)
# Create the cluster.
compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
# For a more detailed view of current Batch AI cluster status, use the 'status' property.
我得到了预期的输出:
Found existing compute target: nc12-cluster-04
但是当我尝试像这样提交工作时:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, experiment_name)
remote_run = experiment.submit(automl_config, show_output = False)
我收到此错误:
~/anaconda3_501/lib/python3.6/site-packages/azureml/_restclient/operations/jasmine_operations.py in post_remote_snapshot_run(self, subscription_id, resource_group_name, workspace_name, project_name, parent_run_id, json_definition, snapshot_id, custom_headers, raw, **operation_config)
237 if response.status_code not in [200]:
--> 238 raise HttpOperationError(self._deserialize, response)
239
HttpOperationError: Operation returned an invalid status code "ErrorMessage: 'BatchAI cluster nc12-cluster-04f7543302 does not exist'. Possible cause: Remote dsvm"
请注意,群集名称现在有一个后缀:nc12-cluster-04 f7543302 ,该前缀不在传递的compute_target中:
vars(compute_target)
{
...
'name': 'nc12-cluster-04',
'provisioning_errors': None,
'provisioning_state': 'Succeeded',
...
}
有什么想法吗?