Question

我有一个执行矩阵乘法的测试应用程序，并尝试使用nvblas卸载到gpu。

#include <armadillo>
#include <iostream>
using namespace arma;
using namespace std;
int main(int argc, char *argv[]) {
    int m = atoi(argv[1]);
    int k = atoi(argv[2]);
    int n = atoi(argv[3]);
    int t = atoi(argv[4]);
    std::cout << "m::" << m << "::k::" << k << "::n::" << n << std::endl;
    mat A;
    A = randu<mat>(m, k);
    mat B;
    B = randu<mat>(k, n);
    mat C;
    C.zeros(m, n);
    cout << "norm c::" << arma::norm(C, "fro") << std::endl;
    tic();
    for (int i = 0; i < t; i++) {
      C = A * B;
    }
    cout << "time taken ::" << toc()/t << endl;
    cout << "norm c::" << arma::norm(C, "fro") << std::endl;
  }

我按如下方式编译了代码。

CPU

g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cpu.out

GPU

g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -lnvblas -L$CUDATOOLKIT_HOME/lib64 -o a.cuda.out

当我用4096 4096 4096运行a.cpu.out和a.cuda.out时，它们都花费了大约11秒的相同时间。我没有看到使用a.gpu.out的时间减少。在nvblas.conf中，除了（a）更改openblas的路径（b）auto_pin内存已启用之外，我将所有内容保留为默认值。我看到nvblas.log在说“设备0”，没有其他输出。 nvidia-smi的gpu活动没有增加，nvprof则显示了一堆cudaMalloc，cudamemcpy，查询设备功能等。但是不存在任何gemm调用。

a.cuda.out上的ldd显示它与nvblas，cublas，cudart和cpu openblas库链接。我在这里犯任何错误吗？

Answer 1

那里的链接顺序是一个问题。当我对gpu执行以下操作时，该问题已解决。

GPU

g++ testmm.cpp -lnvblas -L$CUDATOOLKIT_HOME/lib64 -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cuda.out

通过以上内容，当我转储符号表时，看到以下输出。

nm a.cuda.out | grep -is dgemm
             U cblas_dgemm
             U dgemm_@@libnvblas.so.9.1 <-- this shows correct linking and ability to offload to gpu.

如果未正确链接，则链接问题如下。

nm a.cuda.out | grep -is dgemm
             U cblas_dgemm
             U dgemm_  <-- there will not be a libnvblas here showing it is a problem.

尽管在上述两种情况下ldd都会显示nvblas，cublas，cudart，openblas，但是在执行程序时，dgemm始终是openblas。

dgemm nvblas gpu卸载

CPU

GPU

1 个答案:

GPU