我想编写一个矩阵乘法算法,基于CUDA的共享内存示例,即同时执行计算和数据加载。我的代码看起来像这样:
float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[ty][tx] = A[aBegin + wA * ty + tx];
Bs[ty][tx] = B[bBegin + wB * ty + tx];
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
__shared__ float A2s[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B2s[BLOCK_SIZE][BLOCK_SIZE];
A2s[ty][tx] = As[ty][tx];
B2s[ty][tx] = Bs[ty][tx];
__syncthreads();
if (a+1 <= aEnd)
{
As[ty][tx] = A[a+1 + wA * ty + tx];
Bs[ty][tx] = B[b+1 + wB * ty + tx];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Csub += A2s[ty][k] * B2s[k][tx];
}
__syncthreads();
}
但它的工作速度比原始解决方案慢,因为第二次数据加载是按计算顺序执行的。我怎样才能并行?
答案 0 :(得分:1)
您应避免将数据A
和B
移至本地数组As
和Bs
,即
As[ty][tx] = A[aBegin + wA * ty + tx];
Bs[ty][tx] = B[bBegin + wB * ty + tx];
您可以直接将它们移动到共享内存A2s
和B2s
,即
A2s[ty][tx] = A[aBegin + wA * ty + tx];
B2s[ty][tx] = B[bBegin + wB * ty + tx];
此外,数据加载
As[ty][tx] = A[a+1 + wA * ty + tx];
Bs[ty][tx] = B[b+1 + wB * ty + tx];
似乎未被开发。
最后,您应该将共享内存数组的声明移到for
循环之外,并且还缺少对输出矩阵的最终赋值。
尝试类似:
__global__ void TiledMatrixMultiplicationKernel(float* A, float* B, float* C, int Width)
{
__shared__float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__float Bs[BLOCK_SIZE][BLOCK_SIZE];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * BLOCK_SIZE + ty;
int Col = bx * BLOCK_SIZE + tx;
float Csub = 0;
for (int m = 0; m < Width/BLOCK_SIZE; ++m) {
As[ty][tx] = A[Row*Width + (m*BLOCK_SIZE + tx)];
Bs[ty][tx] = B[Col + (m*BLOCK_SIZE + ty)*Width];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k) {
Csub += As[ty][k] * Bs[k][tx];
__syncthreads();
}
C[Row*Width+Col] = Csub;
}