我正在尝试编写一个简单的matrixMultiplication应用程序,它使用CUDA将两个平方矩阵相乘。我遇到的问题是我的内核只能在网格的块(0,0)中正确计算。
这是我的调用代码:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
这是我的内核函数
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
我认为问题可能与记忆有关,但我有点迷失。我应该怎么做才能使这个代码跨多个块工作?
答案 0 :(得分:1)
问题在于我的CUDA内核调用。网格对于正在处理的矩阵来说太小了。