我是OpenCL的新手,尤其是clBlas。我想实现一个简单的程序,它将矩阵A作为输入,将其与另一个矩阵B相乘,得到矩阵C,然后将C乘以另一个矩阵D,并返回这些运算的结果矩阵。
我想部署clBlas库,试图保持最高性能。我该怎么做?
到目前为止,我已经看到进行单矩阵乘法的常用方法如下:
/* Setup clBLAS */
err = clblasSetup( );
/* Prepare OpenCL memory objects and place matrices inside them. */
bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
NULL, &err );
bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
NULL, &err );
bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
NULL, &err );
err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
M * K * sizeof( *A ), A, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
K * N * sizeof( *B ), B, 0, NULL, NULL );
err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
M * N * sizeof( *C ), C, 0, NULL, NULL );
/* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
M, N, K,
alpha, bufA, 0, lda,
bufB, 0, ldb, beta,
bufC, 0, ldc,
1, &queue, 0, NULL, &event );
/* Wait for calculations to be finished. */
err = clWaitForEvents( 1, &event );
/* Fetch results of calculations from GPU memory. */
err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
M * N * sizeof(*result),
result, 0, NULL, NULL );
但是通过这种方式我们将矩阵从主机内存移动到GPU内存这是一种浪费,有没有办法将结果保存在GPU内存中?