磁盘空间不足

时间:2019-09-11 20:14:26

标签: azure-machine-learning-service

在AML计算上运行AML管道时,出现这种错误:

我可以尝试重新启动群集,但这可能无法解决问题(如果存储中没有任何节点,则应清除该节点。

Session ID: 933fc468-7a22-425d-aa1b-94eba5784faa
{"error":{"code":"ServiceError","message":"Job preparation failed: [Errno 28] No space left on device","detailsUri":null,"target":null,"details":[],"innerError":null,"debugInfo":{"type":"OSError","message":"[Errno 28] No space left on device","stackTrace":" File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 126, in <module>\n invoke()\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 97, in invoke\n extract_project(project_dir, options.project_zip, options.snapshots)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 60, in extract_project\n project_fetcher.fetch_project_snapshot(snapshot[\"Id\"], snapshot[\"PathStack\"])\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 72, in fetch_project_snapshot\n _download_tree(sas_tree, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 106, in _download_tree\n _download_tree(child, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 106, in _download_tree\n _download_tree(child, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 98, in _download_tree\n fh.write(response.read())\n","innerException":null,"data":null,"errorResponse":null}},"correlation":null,"environment":null,"location":null,"time":"0001-01-01T00:00:00+00:00"}

我希望这项工作能够按预期进行。实际上,我已经检查了该节点,并且该节点确实有很多可用的硬盘空间:

root@4f57957ac829466a86bad4d4dc51fadd000001:~# df -kh                                                                                               Filesystem      Size  Used Avail Use% Mounted on
udev             28G     0   28G   0% /dev
tmpfs           5.6G  9.0M  5.5G   1% /run
/dev/sda1       125G  2.8G  122G   3% /
tmpfs            28G     0   28G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            28G     0   28G   0% /sys/fs/cgroup
/dev/sdb1       335G  6.7G  311G   3% /mnt
tmpfs           5.6G     0  5.6G   0% /run/user/1002

关于我应该检查什么的建议?

2 个答案:

答案 0 :(得分:1)

似乎您遇到了Azure文件共享约束。您可以使用以下示例代码将运行更改为使用Blob存储,该存储可以扩展为并行运行的大量作业:

https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#accessing-source-code-during-training

答案 1 :(得分:0)

我们还在开发一项功能,可以在运行作业之前或之后清理磁盘。目前还没有预计到达时间。