CUDA MPS服务器无法在具有多个GPU的工作站上启动

时间:2016-03-15 15:06:13

标签: cuda mpi nvidia

编辑:我试图通过使用他们的UUID而不是他们的ID来枚举有效的GPU,这会使事情发挥作用。

似乎它仍然看到了GT 610,尽管我认为它不应该。这就是它无法正常工作的原因。

我的一台机器上的cuda MPS有困难。

该机器有4个特斯拉K80,以及一个(编辑:)非MPS支持的GT610

这是nvidia-smi:

riveale@coiworkstation1:~/code/psweep2/src$ nvidia-smi
Tue Mar 15 23:51:59 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 610      Off  | 0000:01:00.0     N/A |                  N/A |
| 40%   29C    P8    N/A /  N/A |      3MiB /  1021MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   29C    P8    26W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   24C    P8    30W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
| N/A   34C    P8    27W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   28C    P8    29W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   31C    P8    28W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    30W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   31C    P8    26W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   8  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   25C    P8    31W / 149W |     55MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+

如您所见,我已将处理器设置为独占进程。

我可以仅使用第一个GPU运行健全性检查,启动MPS服务器等,如下所示:

export CUDA_VISIBLE_DEVICES="0"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

然后我运行我的脚本:

NRANKS=4
mpirun -n $NRANKS gputest.exe

这成功运行,我在/tmp/nvidia-log/server.log中看到:

riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log 
[2016-03-15 23:57:07.883 Other  6957] Start
[2016-03-15 23:57:08.513 Other  6957] New client 6956 connected
[2016-03-15 23:57:08.513 Other  6957] New client 6954 connected
[2016-03-15 23:57:08.514 Other  6957] New client 6955 connected

然而,当我尝试在系统上使用超过1个GPU时,我遇到了问题。具体来说,以下(完全相同,但现在我有2个可见的CUDA设备):

export CUDA_VISIBLE_DEVICES="0,1"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

(ps ax | grep mps显示守护进程刚好开始,与上面的工作示例没有区别)。 其次是:

NRANKS=7
mpirun -n $NRANKS gputest.exe

我明白了:

riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log 
[2016-03-15 23:59:55.718 Other  7102] Start
[2016-03-15 23:59:56.301 Other  7102] MPS server failed to start
[2016-03-15 23:59:56.301 Other  7102] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:56.727 Other  7105] Start
[2016-03-15 23:59:57.302 Other  7105] MPS server failed to start
[2016-03-15 23:59:57.302 Other  7105] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:57.718 Other  7107] Start
[2016-03-15 23:59:58.291 Other  7107] MPS server failed to start
[2016-03-15 23:59:58.291 Other  7107] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:58.709 Other  7109] Start
[2016-03-15 23:59:59.236 Other  7109] MPS server failed to start
[2016-03-15 23:59:59.236 Other  7109] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:59.644 Other  7111] Start
[2016-03-16 00:00:00.215 Other  7111] MPS server failed to start
[2016-03-16 00:00:00.215 Other  7111] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-16 00:00:00.651 Other  7113] Start
[2016-03-16 00:00:01.221 Other  7113] MPS server failed to start
[2016-03-16 00:00:01.221 Other  7113] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.

怪异。

提前感谢您提供任何帮助/想法。

另一个奇怪的是,在我的另一个工作站上完全相同,它具有相同的设置,除了它有一个Quadro K620而不是GT610。 K620是一款CUDA设备,所以我觉得这就是问题所在。现在我是远程的,所以我无法关闭卡片以查看是否会改变问题。

1 个答案:

答案 0 :(得分:1)

如编辑中标记的那样,解决方案是使用cc> 3.5 GPU的UUID并将CUDA_VISIBLE_DEVICES设置为该值。似乎无论出于何种原因,即使设备0正确地是K80之一,也出于某种原因将显示设备(610等)列为设备#1,而不是最后一个设备,正如我预期的那样。 p>

E.g:

footer {
  bottom: 0;
  left: 0;
  right: 0;
  position: absolute;
  margin: 40px;
  font-size: 2em;
}

我必须在每个节点/机器上启动上面的nvidia-cuda-mps-control -d脚本之前执行此操作。

事实证明MPS很慢(MPS服务器需要很多CPU),所以我决定不使用它。