有时当我使用单个GPU运行TensorFlow但在多GPU设置中,代码将在一个GPU上执行,但在另一个GPU上分配内存。由于显而易见的原因,这导致了大幅放缓。
例如,请参阅nvidia-smi
的以下结果。在这里,我的同事正在使用gpus 0和1(进程32918和33112),我使用以下命令启动TensorFlow(在导入张量流之前)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
其中gpu_id = 2,3和4分别用于我的三个进程。我们可以看到,内存在gpus 2,3和4上正确分配,但代码在其他地方执行!在这种情况下,在gpus 0,1和7上。
Wed May 17 17:04:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 41C P0 75W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 36C P0 89W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 61C P0 58W / 149W | 6265MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 42C P0 70W / 149W | 8313MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 51C P0 55W / 149W | 8311MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 29C P0 68W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 31C P0 54W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 27C P0 68W / 149W | 0MiB / 11439MiB | 33% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32918 C python 274MiB |
| 1 33112 C python 274MiB |
| 2 34891 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 3 34989 C ...sadl/anaconda3/envs/tensorflow/bin/python 8309MiB |
| 4 35075 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
+-----------------------------------------------------------------------------+
似乎tensorflow由于某种原因部分忽略了“CUDA_VISIBLE_DEVICES”选项。
我没有在代码中使用任何设备放置命令。
这是在ubuntu 16.04上运行的TensorFlow 1.1的体验,并且在一系列不同的场景中发生在我身上。
是否存在可能发生这种情况的已知情景?如果是这样,我能做些什么吗?
答案 0 :(得分:1)
其中一个可能的原因是" nvidia-smi"。
nvidia-smi顺序与GPU ID不同。
"建议希望一致性的用户使用UUDI或PCI总线ID,因为设备枚举排序不能保证一致"
" FASTEST_FIRST使CUDA使用简单的启发式猜测哪个设备最快,并使该设备为0,而未指定其余设备的顺序。 PCI_BUS_ID按PCI总线ID按升序对设备进行排序。"
看看这里:http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
此处还讨论了:Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName()
答案 1 :(得分:0)
我解决了这个问题。
似乎问题与nvidia-smi而不是tensorflow有关,如果你通过sudo nvidia-smi -pm 1
在gpus上启用持久性模式,则显示正确的状态,例如类似的东西:
Fri May 19 15:28:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 60C P0 143W / 149W | 6263MiB / 11439MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 46C P0 136W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:08:00.0 Off | 0 |
| N/A 64C P0 110W / 149W | 8311MiB / 11439MiB | 67% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:09:00.0 Off | 0 |
| N/A 48C P0 142W / 149W | 8311MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 26C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:88:00.0 Off | 0 |
| N/A 28C P8 26W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:89:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 42840 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 1 42878 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 2 43264 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
| 3 4721 C python 8307MiB |
+-----------------------------------------------------------------------------+
感谢您解决此问题的输入。