Question

我面临一个GPU扭矩分配的奇怪问题。

我在一台拥有两台NVIDIA GTX Titan X GPU的机器上运行Torque 6.1.0。我正在使用pbs_sched进行调度。 nvidia-smi休息时的输出如下：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0      On |                  N/A |
| 22%   40C    P8    15W / 250W |      0MiB / 12204MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      0MiB / 12207MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我有一个简单的测试脚本来评估GPU分配如下：

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

deviceQuery是CUDA附带的实用程序。当我从命令行运行它时，它正确找到两个GPU。当我从命令行限制到一个设备时......

CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

...它也正确找到一个或另一个GPU。

当我使用qsub将test.sh提交到队列时，并且当没有其他作业正在运行时，它再次正常工作。这是输出：

CUDA_VISIBLE_DEVICES: 0 
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"   CUDA Driver Version / Runtime Version          8.0 / 8.0   CUDA Capability Major/Minor version number:    5.2   Total amount of global memory:                 12204 MBytes (12796887040 bytes)   (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores   GPU Max Clock rate:                    1076 MHz (1.08 GHz)   Memory Clock rate:                             3505 Mhz   Memory Bus Width:                              384-bit   L2 Cache Size:                                 3145728 bytes   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers   Total amount of constant memory:               65536 bytes   Total amount of shared memory per block:       49152 bytes   Total number of registers available per block: 65536   Warp size:                                     32   Maximum number of threads per multiprocessor:  2048   Maximum number of threads per block:           1024   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:            2147483647 bytes   Texture alignment:                             512 bytes   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)   Run time limit on kernels:                     No   Integrated GPU sharing Host Memory:            No   Support host page-locked memory mapping:       Yes   Alignment requirement for Surfaces:            Yes   Device has ECC support:                     Disabled   Device supports Unified Addressing (UVA):      Yes   Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0   Compute Mode:
     < Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS

但是，如果作业已在gpu0上运行（即，如果它被分配了CUDA_VISIBLE_DEVICES = 1），则作业找不到任何GPU。输出：

CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

任何人都知道这里发生了什么？

Answer 1

我想我已经解决了自己的问题，但不幸的是我一次尝试了两件事。我不想回去确认哪个解决了这个问题。它是以下之一：

在构建之前从Torque的配置脚本中删除--enable-cgroups选项。
在Torque安装过程中运行以下步骤：

制作套餐

sh torque-package-server-linux-x86_64.sh --install

sh torque-package-mom-linux-x86_64.sh --install

sh torque-package-clients-linux-x86_64.sh --install

对于第二个选项，我知道这些步骤已在Torque安装说明中正确记录。但是，我有一个简单的设置，我只有一个节点（计算节点和服务器是同一台机器）。我认为'make install'应该执行包安装为该单个节点执行的所有操作，但也许我错了。

当CUDA_VISIBLE_DEVICES不等于0时，扭矩作业无法找到GPU

1 个答案: