如何强制在群集中使用其他GPU?

时间:2014-07-01 22:22:24

标签: cuda neural-network cluster-computing gpu multi-gpu

我正在使用Caffe,这是一个使用GPU(或CPU)的卷积神经网络的框架。它主要使用CUDA 6.0,我正在训练一个拥有大量图像数据集的CNN(ImageNet数据集= 1.2百万个图像),并且需要大量内存。但是我正在对原始子集进行小型实验(这也需要大量的内存)。我也在开发一个gpu集群。这是命令$ nvidia-smi

的输出
+------------------------------------------------------+                       
| NVIDIA-SMI 331.62     Driver Version: 331.62         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2050         Off  | 0000:08:00.0     Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |   1585MiB /  2687MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2050         Off  | 0000:09:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M2050         Off  | 0000:0A:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M2050         Off  | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M2050         Off  | 0000:16:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M2050         Off  | 0000:19:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M2050         Off  | 0000:1A:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M2050         Off  | 0000:1B:00.0     Off |                    0 |
| N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     10242  ../../../build/tools/train_net.bin                  1577MiB |
+-----------------------------------------------------------------------------+

但是当我尝试运行这些多个进程(例如,在不同的数据集上运行相同的train_net.bin)时,它们会失败,因为它们在同一GPU上运行,我想知道如何强制使用另一个GPU。我将不胜感激任何帮助。

0 个答案:

没有答案