如何在SLURM的两个作业步骤之间共享GPU,我可以在两个步骤之间共享CPU,但不能在GPU之间共享。
srun --pty --gpus=1 bash
compute-node-11:~$ nvidia-smi
Sun Feb 16 22:42:47 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 36C P0 44W / 160W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
srun --pty --jobid=1164 bash
srun: Job 1164 step creation temporarily disabled, retrying```
即使我尝试了-超额订阅,但还是同样的问题。分两个步骤共享GPU b / w是不可能的吗?