Tensorflow执行和不同GPU上的内存

时间:2017-05-17 15:57:34

标签: python tensorflow

有时当我使用单个GPU运行TensorFlow但在多GPU设置中,代码将在一个GPU上执行,但在另一个GPU上分配内存。由于显而易见的原因,这导致了大幅放缓。

例如,请参阅nvidia-smi的以下结果。在这里,我的同事正在使用gpus 0和1(进程32918和33112),我使用以下命令启动TensorFlow(在导入张量流之前)

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)

其中gpu_id = 2,3和4分别用于我的三个进程。我们可以看到,内存在gpus 2,3和4上正确分配,但代码在其他地方执行!在这种情况下,在gpus 0,1和7上。

Wed May 17 17:04:01 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   41C    P0    75W / 149W |    278MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   36C    P0    89W / 149W |    278MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
| N/A   61C    P0    58W / 149W |   6265MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   42C    P0    70W / 149W |   8313MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   51C    P0    55W / 149W |   8311MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   29C    P0    68W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   31C    P0    54W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   27C    P0    68W / 149W |      0MiB / 11439MiB |     33%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     32918    C   python                                         274MiB |
|    1     33112    C   python                                         274MiB |
|    2     34891    C   ...sadl/anaconda3/envs/tensorflow/bin/python  6259MiB |
|    3     34989    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8309MiB |
|    4     35075    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
+-----------------------------------------------------------------------------+

似乎tensorflow由于某种原因部分忽略了“CUDA_VISIBLE_DEVICES”选项。

我没有在代码中使用任何设备放置命令。

这是在ubuntu 16.04上运行的TensorFlow 1.1的体验,并且在一系列不同的场景中发生在我身上。

是否存在可能发生这种情况的已知情景?如果是这样,我能做些什么吗?

2 个答案:

答案 0 :(得分:1)

其中一个可能的原因是" nvidia-smi"。

nvidia-smi顺序与GPU ID不同。

"建议希望一致性的用户使用UUDI或PCI总线ID,因为设备枚举排序不能保证一致"

" FASTEST_FIRST使CUDA使用简单的启发式猜测哪个设备最快,并使该设备为0,而未指定其余设备的顺序。 PCI_BUS_ID按PCI总线ID按升序对设备进行排序。"

看看这里:http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

此处还讨论了:Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName()

答案 1 :(得分:0)

我解决了这个问题。

似乎问题与nvidia-smi而不是tensorflow有关,如果你通过sudo nvidia-smi -pm 1在gpus上启用持久性模式,则显示正确的状态,例如类似的东西:

Fri May 19 15:28:06 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 |
| N/A   60C    P0   143W / 149W |   6263MiB / 11439MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
| N/A   46C    P0   136W / 149W |   8311MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:08:00.0     Off |                    0 |
| N/A   64C    P0   110W / 149W |   8311MiB / 11439MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:09:00.0     Off |                    0 |
| N/A   48C    P0   142W / 149W |   8311MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   32C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:88:00.0     Off |                    0 |
| N/A   28C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:89:00.0     Off |                    0 |
| N/A   25C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     42840    C   ...sadl/anaconda3/envs/tensorflow/bin/python  6259MiB |
|    1     42878    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
|    2     43264    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
|    3      4721    C   python                                        8307MiB |
+-----------------------------------------------------------------------------+

感谢您解决此问题的输入。