首先,我对Fortran / C / CUDA有点新意。其次,我正在研究一个使用cuBLAS在GPU上执行矩阵向量乘法的Fortran / C程序。在需要更新矩阵内容之前,我需要将多个(最多1000个)向量与一个矩阵相乘。但是,每当新的向量发送到GPU时,我必须重新分配矩阵的当前版本(由于矩阵没有改变,这非常浪费和缓慢)。
我希望能够将矩阵与向量相乘,而无需为每个向量重新分配矩阵。我参与调用一个单独的C函数的想法,该函数将矩阵分配给GPU,返回指向Fortran主程序的指针,然后调用执行矩阵向量乘法的另一个C函数。
使用ISO_C_BINDING,我将指向浮点数的指针返回到变量:
type(C_PTR) :: ptr
当我尝试将其传递给矩阵向量C函数时:
Fortran中的
call cudaFunction(ptr,vector, N)
在C
extern "C" void cudaFunction_(float *mat, float *vector, int *N)
所有内容都编译并运行,但是cublasSgemv的执行无法执行。关于为什么会发生这种情况的任何想法?我已经看到了一些相关的帖子,但他们从未尝试将返回的指针发送回C,这就是(我相信)我遇到问题的地方。
提前致谢!
答案 0 :(得分:2)
我建议您不要重新发明轮子,而是使用为此目的提供的cublas fortran bindings。
" thunking"包装器不是你想要的。只要你在fortran中使用cublas调用,它就会根据需要进行隐式复制操作。
你想要"非thunking"包装器,因此您可以明确控制正在进行的复制。您可以使用fortran等效的Get/SetMatrix
和Get/SetVector
来回复制数据。
有一个示例代码(示例B.2),展示了如何使用cublas文档中包含的非thunking包装器。
即使你想重新发明轮子,包装器也会告诉你如何在C和Fortran之间移动必要的语法。
在标准的Linux CUDA安装中,包装器位于/usr/local/cuda/src
非thunking包装器是/usr/local/cuda/src/fortran.c
这是一个功能齐全的例子:
cublasf.f:
program cublas_fortran_example
implicit none
integer i, j
c helper functions
integer cublas_init
integer cublas_shutdown
integer cublas_alloc
integer cublas_free
integer cublas_set_vector
integer cublas_get_vector
c selected blas functions
double precision cublas_ddot
external cublas_daxpy
external cublas_dscal
external cublas_dcopy
double precision cublas_dnrm2
c cublas variables
integer cublas_status
real*8 x(30), y(30)
double precision alpha, beta
double precision nrm
integer*8 d_x, d_y, d_alpha, d_beta, d_nrm
integer*8 dsize1, dlength1, dlength2
double precision dresult
write(*,*) "testing cublas fortran example"
c initialize cublas library
c CUBLAS_STATUS_SUCCESS=0
cublas_status = cublas_init()
if (cublas_status /= 0) then
write(*,*) "CUBLAS Library initialization failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
c initialize data
do j=1,30
x(j) = 1.0
y(j) = 2.0
enddo
dsize1 = 8
dlength1 = 30
dlength2 = 1
alpha = 2.0
beta = 3.0
c allocate device storage
cublas_status = cublas_alloc(dlength1, dsize1, d_x)
if (cublas_status /= 0) then
write(*,*) "CUBLAS device malloc failed"
stop
endif
cublas_status = cublas_alloc(dlength1, dsize1, d_y)
if (cublas_status /= 0) then
write(*,*) "CUBLAS device malloc failed"
stop
endif
cublas_status = cublas_alloc(dlength2, dsize1, d_alpha)
if (cublas_status /= 0) then
write(*,*) "CUBLAS device malloc failed"
stop
endif
cublas_status = cublas_alloc(dlength2, dsize1, d_beta)
if (cublas_status /= 0) then
write(*,*) "CUBLAS device malloc failed"
stop
endif
cublas_status = cublas_alloc(dlength2, dsize1, d_nrm)
if (cublas_status /= 0) then
write(*,*) "CUBLAS device malloc failed"
stop
endif
c copy data from host to device
cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2,
> d_x, dlength2)
if (cublas_status /= 0) then
write(*,*) "CUBLAS copy to device failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2,
> d_y, dlength2)
if (cublas_status /= 0) then
write(*,*) "CUBLAS copy to device failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2)
write(*,*) "dot product result=",dresult
dresult = cublas_dnrm2(dlength1, d_x, dlength2)
write(*,*) "nrm2 of x result=",dresult
dresult = cublas_dnrm2(dlength1, d_y, dlength2)
write(*,*) "nrm2 of y result=",dresult
call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2)
cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
> y, dlength2)
if (cublas_status /= 0) then
write(*,*) "CUBLAS copy to host failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
write(*,*) "daxpy y(1) =", y(1)
write(*,*) "daxpy y(30) =", y(30)
call cublas_dscal(dlength1, beta, d_x, dlength2)
cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2,
> x, dlength2)
if (cublas_status /= 0) then
write(*,*) "CUBLAS copy to host failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
write(*,*) "dscal x(1) =", x(1)
write(*,*) "dscal x(30) =", x(30)
call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2)
cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
> y, dlength2)
if (cublas_status /= 0) then
write(*,*) "CUBLAS copy to host failed"
write(*,*) "cublas_status=",cublas_status
stop
endif
write(*,*) "dcopy y(1) =", y(1)
write(*,*) "dcopy y(30) =", y(30)
c deallocate GPU memory and exit
cublas_status = cublas_free(d_x)
cublas_status = cublas_free(d_y)
cublas_status = cublas_free(d_alpha)
cublas_status = cublas_free(d_beta)
cublas_status = cublas_free(d_nrm)
cublas_status = cublas_shutdown()
stop
end
编译/运行:
$ gfortran -c -o cublasf.o cublasf.f
$ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c
$ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o
$ ./cublasf
testing cublas fortran example
dot product result= 60.0000000000000
nrm2 of x result= 5.47722557505166
nrm2 of y result= 10.9544511501033
daxpy y(1) = 4.00000000000000
daxpy y(30) = 4.00000000000000
dscal x(1) = 3.00000000000000
dscal x(30) = 3.00000000000000
dcopy y(1) = 3.00000000000000
dcopy y(30) = 3.00000000000000
$
CUDA 5.0,RHEL 5.5