返回指向设备分配矩阵的指针,从C到Fortran

时间:2014-03-13 21:00:36

标签: c cuda fortran intel-fortran

首先,我对Fortran / C / CUDA有点新意。其次,我正在研究一个使用cuBLAS在GPU上执行矩阵向量乘法的Fortran / C程序。在需要更新矩阵内容之前,我需要将多个(最多1000个)向量与一个矩阵相乘。但是,每当新的向量发送到GPU时,我必须重新分配矩阵的当前版本(由于矩阵没有改变,这非常浪费和缓慢)。

我希望能够将矩阵与向量相乘,而无需为每个向量重新分配矩阵。我参与调用一个单独的C函数的想法,该函数将矩阵分配给GPU,返回指向Fortran主程序的指针,然后调用执行矩阵向量乘法的另一个C函数。

使用ISO_C_BINDING,我将指向浮点数的指针返回到变量:

type(C_PTR) :: ptr

当我尝试将其传递给矩阵向量C函数时:

Fortran中的

call cudaFunction(ptr,vector, N)

在C

extern "C" void cudaFunction_(float *mat, float *vector, int *N)

所有内容都编译并运行,但是cublasSgemv的执行无法执行。关于为什么会发生这种情况的任何想法?我已经看到了一些相关的帖子,但他们从未尝试将返回的指针发送回C,这就是(我相信)我遇到问题的地方。

提前致谢!

1 个答案:

答案 0 :(得分:2)

我建议您不要重新发明轮子,而是使用为此目的提供的cublas fortran bindings

" thunking"包装器不是你想要的。只要你在fortran中使用cublas调用,它就会根据需要进行隐式复制操作。

你想要"非thunking"包装器,因此您可以明确控制正在进行的复制。您可以使用fortran等效的Get/SetMatrixGet/SetVector来回复制数据。

有一个示例代码(示例B.2),展示了如何使用cublas文档中包含的非thunking包装器。

即使你想重新发明轮子,包装器也会告诉你如何在C和Fortran之间移动必要的语法。

在标准的Linux CUDA安装中,包装器位于/usr/local/cuda/src 非thunking包装器是/usr/local/cuda/src/fortran.c

这是一个功能齐全的例子:

cublasf.f:

      program cublas_fortran_example
      implicit none
      integer i, j

c     helper functions
      integer cublas_init
      integer cublas_shutdown
      integer cublas_alloc
      integer cublas_free
      integer cublas_set_vector
      integer cublas_get_vector
c     selected blas functions
      double precision cublas_ddot
      external cublas_daxpy
      external cublas_dscal
      external cublas_dcopy
      double precision cublas_dnrm2
c     cublas variables
      integer cublas_status
      real*8 x(30), y(30)
      double precision alpha, beta
      double precision nrm
      integer*8 d_x, d_y, d_alpha, d_beta, d_nrm
      integer*8 dsize1, dlength1, dlength2
      double precision dresult



      write(*,*) "testing cublas fortran example"

c     initialize cublas library
c     CUBLAS_STATUS_SUCCESS=0
      cublas_status = cublas_init()
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS Library initialization failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
c     initialize data
      do j=1,30
        x(j) = 1.0
        y(j) = 2.0
      enddo
      dsize1 = 8
      dlength1 = 30
      dlength2 = 1
      alpha = 2.0
      beta = 3.0
c     allocate device storage
      cublas_status = cublas_alloc(dlength1, dsize1, d_x)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength1, dsize1, d_y)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_alpha)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_beta)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_nrm)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif

c     copy data from host to device

      cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2,
     >     d_x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2,
     >     d_y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif

      dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2)
      write(*,*) "dot product result=",dresult

      dresult = cublas_dnrm2(dlength1, d_x, dlength2)
      write(*,*) "nrm2 of x result=",dresult

      dresult = cublas_dnrm2(dlength1, d_y, dlength2)
      write(*,*) "nrm2 of y result=",dresult

      call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "daxpy y(1)  =", y(1)
      write(*,*) "daxpy y(30) =", y(30)


      call cublas_dscal(dlength1, beta, d_x, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2,
     >     x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dscal x(1)  =", x(1)
      write(*,*) "dscal x(30) =", x(30)


      call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dcopy y(1)  =", y(1)
      write(*,*) "dcopy y(30) =", y(30)

c     deallocate GPU memory and exit
      cublas_status = cublas_free(d_x)
      cublas_status = cublas_free(d_y)
      cublas_status = cublas_free(d_alpha)
      cublas_status = cublas_free(d_beta)
      cublas_status = cublas_free(d_nrm)
      cublas_status = cublas_shutdown()
      stop
      end

编译/运行:

$ gfortran -c -o cublasf.o cublasf.f
$ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c
$ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o
$ ./cublasf
 testing cublas fortran example
 dot product result=   60.0000000000000
 nrm2 of x result=   5.47722557505166
 nrm2 of y result=   10.9544511501033
 daxpy y(1)  =   4.00000000000000
 daxpy y(30) =   4.00000000000000
 dscal x(1)  =   3.00000000000000
 dscal x(30) =   3.00000000000000
 dcopy y(1)  =   3.00000000000000
 dcopy y(30) =   3.00000000000000
$ 

CUDA 5.0,RHEL 5.5