Question

我按照说明here与nvblas运行八度音程。我安装了CUDA工具包7.5和tesla k40c GPU。要使用nvblas开始八度音阶，我使用了LD_PRELOAD=libnvblas.so octave。然后我运行了以下简单代码：

N = 256
A = rand(N,N)
B = rand(N,N)
A*B

生成具有合理值的矩阵。但是，如果我将N增加到512，或任何超过512的数字，我会得到全零（或非常小的数字）。

如果我使用OpenBLAS，则不会发生这种情况。矩阵应该足够小，以便它们适合卡的RAM（12GB）。知道为什么会这样吗？

注意：如果我制作A和B标识矩阵，则不会发生这种情况，但仍然会发生A = B = 1（N，N）。

Answer 1

很抱歉这个问题有些陈旧，但我在带有k80 gpu的Amazon AWS EC2 p2.xlarge实例上尝试过这个问题，但它似乎有用了。

当我有默认值＆＃34; NVBLAS_GPU_LIST 0 1＆＃34;时，我得到了类似的结果（很多零）。在nvblas.conf中设置，它似乎是指两个GPU，因此我将其更改为一个并且它有效。完整文件如下：

#Put here the CPU BLAS fallback Library of your choice
NVBLAS_CPU_BLAS_LIB libopenblas.so

# Specify which output log file (default is stderr)
NVBLAS_LOGFILE nvblas.log

# List of GPU devices Id to participate to the computation
# By default if no GPU are listed, only device 0 will be used
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

程序（t1.m）从NVidia链接略微修改，以计算输出矩阵中的非零数：

N = 16384;

# from the original NVidia example:
#A = single(rand(N,N));
#B = single(rand(N,N));

# double precision seems to work fine (not checked in detail)
A = rand(N,N);
B = rand(N,N);

start = clock();
C = A * B;
elapsedTime = etime(clock(), start);
disp(elapsedTime);
gFlops = 2*N*N*N/(elapsedTime * 1e+9);
disp(gFlops);

disp("number of elements >0:")
disp(sum(sum(C > 0)));

disp("Should be:")
disp(N*N)

FYI这是nvidia-smi输出，当它如上所述运行时（它的最高值为172MiB，N = 16384）：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   44C    P0    80W / 149W |     80MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     21080    C   /usr/bin/octave-cli                             78MiB |
+-----------------------------------------------------------------------------+

这是nvidia＆amp;我之前安装的cuda文件：

cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb  
libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb
libcudnn5_5.1.10-1+cuda8.0_amd64.deb                   
nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb

我似乎加速了大约8.6，从普通八度音调大约55 gflops，从GPU版本大约47 gfl。

对于半大矩阵乘法，NVBLAS无声地失败

1 个答案: