使用cuBLAS访问子矩阵

时间:2013-02-07 04:17:50

标签: matrix cuda fortran partitioning cublas

我已阅读以下帖子

Accessing submatrices using LAPACK

我想做类似的事情,从Fortran调用cuBLAS例程。

基本上我有一个大的矩阵分区为3 x 3块,分区在循环的每一步中都会发生变化。目前,我为每个子块分配/释放指针,并在每一步将矩阵的相关部分复制到设备和从设备复制。这产生了很多我希望消除的开销。这可行吗?

2 个答案:

答案 0 :(得分:4)

您可以使用与主机指针相同的方式在主机代码中执行设备指针运算。例如,如果您在GPU上存储了MxN矩阵:

 float *A_d;
 cudaMalloc((void **)&A_d, size_t(M*N)*sizeof(float));

并且您希望对从(x1,y1)开始的子矩阵进行操作,然后将A+x1+M*y1传递给任何需要矩阵作为参数的CUBLAS函数。

答案 1 :(得分:3)

talonmies已经圆满地回答了这个问题。为了支持他的答案并对其他用户可能有用,我在此提供了一个完整的示例,说明如何使用cublas<t>gemm在完整矩阵的子矩阵AB之间进行乘法运算以及如何将结果分配给完整矩阵C的子矩阵。

虽然问题与Fortran有关,但下面的代码是用C / C ++给出的,因为我没有将Fortran与CUDA结合使用,并且因为许多用户正在使用CUDA来连接C / C ++。

代码使用

  1. 指向算术以访问子矩阵;
  2. 主要维度和子矩阵维度的概念。
  3. 下面的代码考虑了三个矩阵:

    1. A - 10 x 9;
    2. B - 15 x 13;
    3. C - 10 x 12
    4. 矩阵C已初始化为所有10。该代码在Matlab语言中执行以下子矩阵乘法:

      C(1+x3:5+x3,1+y3:3+y3) = A(1+x1:5+x1,1+y1:4+y1) * B(1+x2:4+x2,1+y2:3+x2);
      

      Utilities.cuUtilities.cuh个文件被隐藏here,此处省略。

      #include <thrust/device_vector.h>
      #include <thrust/random.h>
      
      #include <cublas_v2.h>
      
      #include "Utilities.cuh"
      
      /********/
      /* MAIN */
      /********/
      int main()
      {
          /**************************/
          /* SETTING UP THE PROBLEM */
          /**************************/
      
          //const int Nrows1 = 10;            // --- Number of rows of matrix 1
          //const int Ncols1 = 10;            // --- Number of columns of matrix 1
      
          //const int Nrows2 = 15;            // --- Number of rows of matrix 2
          //const int Ncols2 = 15;            // --- Number of columns of matrix 2
      
          //const int Nrows3 = 12;            // --- Number of rows of matrix 3
          //const int Ncols3 = 12;            // --- Number of columns of matrix 3
      
          const int Nrows1 = 10;          // --- Number of rows of matrix 1
          const int Ncols1 = 9;           // --- Number of columns of matrix 1
      
          const int Nrows2 = 15;          // --- Number of rows of matrix 2
          const int Ncols2 = 13;          // --- Number of columns of matrix 2
      
          const int Nrows3 = 10;          // --- Number of rows of matrix 3
          const int Ncols3 = 12;          // --- Number of columns of matrix 3
      
          const int Nrows = 5;            // --- Number of rows of submatrix matrix 3 = Number of rows of submatrix 1
          const int Ncols = 3;            // --- Number of columns of submatrix matrix 3 = Number of columns of submatrix 2
      
          const int Nrowscols = 4;        // --- Number of columns of submatrix 1 and of rows of submatrix 2
      
          const int x1 = 3;               // --- Offset for submatrix multiplication along the rows
          const int y1 = 2;               // --- Offset for submatrix multiplication along the columns
      
          const int x2 = 6;               // --- Offset for submatrix multiplication along the rows
          const int y2 = 4;               // --- Offset for submatrix multiplication along the columns
      
          const int x3 = 3;               // --- Offset for submatrix multiplication along the rows
          const int y3 = 5;               // --- Offset for submatrix multiplication along the columns
      
          // --- Random uniform integer distribution between 0 and 100
          thrust::default_random_engine rng;
          thrust::uniform_int_distribution<int> dist(0, 20);
      
          // --- Matrix allocation and initialization
          thrust::device_vector<float> d_matrix1(Nrows1 * Ncols1);
          thrust::device_vector<float> d_matrix2(Nrows2 * Ncols2);
          for (size_t i = 0; i < d_matrix1.size(); i++) d_matrix1[i] = (float)dist(rng);
          for (size_t i = 0; i < d_matrix2.size(); i++) d_matrix2[i] = (float)dist(rng);
      
          printf("\n\nOriginal full size matrix A\n");
          for(int i = 0; i < Nrows1; i++) {
              std::cout << "[ ";
              for(int j = 0; j < Ncols1; j++) 
                  std::cout << d_matrix1[j * Nrows1 + i] << " ";
              std::cout << "]\n";
          }
      
          printf("\n\nOriginal full size matrix B\n");
          for(int i = 0; i < Nrows2; i++) {
              std::cout << "[ ";
              for(int j = 0; j < Ncols2; j++) 
                  std::cout << d_matrix2[j * Nrows2 + i] << " ";
              std::cout << "]\n";
          }
      
          /*************************/
          /* MATRIX MULTIPLICATION */
          /*************************/
          cublasHandle_t handle;
      
          cublasSafeCall(cublasCreate(&handle));
      
          thrust::device_vector<float> d_matrix3(Nrows3 * Ncols3, 10.f);
      
          float alpha = 1.f;
          float beta  = 0.f;
          cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, Nrows, Ncols, Nrowscols, &alpha,
                         thrust::raw_pointer_cast(d_matrix1.data())+x1+Nrows1*y1, Nrows1, thrust::raw_pointer_cast(d_matrix2.data())+x2+Nrows2*y2, Nrows2,
                         &beta, thrust::raw_pointer_cast(d_matrix3.data())+x3+Nrows3*y3, Nrows3));
      
          printf("\n\nResult full size matrix C\n");
          for(int i = 0; i < Nrows3; i++) {
              std::cout << "[ ";
              for(int j = 0; j < Ncols3; j++) 
                  std::cout << d_matrix3[j * Nrows3 + i] << " ";
              std::cout << "]\n";
          }
      
          return 0; 
      }