我正在使用Caffe,这是一个使用GPU(或CPU)的卷积神经网络的框架。它主要使用CUDA 6.0,我正在训练一个拥有大量图像数据集的CNN(ImageNet数据集= 1.2百万个图像),并且需要大量内存。但是我正在对原始子集进行小型实验(这也需要大量的内存)。我也在开发一个gpu集群。这是命令$ nvidia-smi
的输出+------------------------------------------------------+
| NVIDIA-SMI 331.62 Driver Version: 331.62 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2050 Off | 0000:08:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1585MiB / 2687MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2050 Off | 0000:09:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M2050 Off | 0000:0A:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M2050 Off | 0000:15:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla M2050 Off | 0000:16:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla M2050 Off | 0000:19:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla M2050 Off | 0000:1A:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla M2050 Off | 0000:1B:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 10242 ../../../build/tools/train_net.bin 1577MiB |
+-----------------------------------------------------------------------------+
但是当我尝试运行这些多个进程(例如,在不同的数据集上运行相同的train_net.bin)时,它们会失败,因为它们在同一GPU上运行,我想知道如何强制使用另一个GPU。我将不胜感激任何帮助。