我是CUDA的新手。我有一个内核来做矩阵乘法。对我来说似乎没问题,但在某些情况下失败了。请帮我解决问题所在。
__global__ void matrixMultiply(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns)
{
//@@ Insert code to implement matrix multiplication here
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if (numAColumns != numBRows) return;
if ((Row < numARows) && (Col < numBColumns)){
float Cvalue = 0;
for (int k = 0 ; k < numAColumns ; ++k )
Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
C[Row*numCColumns + Col] = Cvalue;
__syncthreads();
}
}
我按如下方式调用内核。
int BLOCKX = (int)(ceil((numCRows / 8.0)));
int BLOCKY = (int)(ceil((numCColumns / 8.0)));
printf("Number of blocks: %d\t%d\n", BLOCKX, BLOCKY);
dim3 DimGrid(BLOCKX, BLOCKY);
dim3 DimBlock(8 , 8, 1);
答案 0 :(得分:1)
您的代码将在下面发生死锁:
if ((Row < numARows) && (Col < numBColumns)){
float Cvalue = 0;
for (int k = 0 ; k < numAColumns ; ++k )
Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
C[Row*numCColumns + Col] = Cvalue;
__syncthreads();
}
考虑一个块,对于某些线程,满足条件,而对某些线程则不满足。在这种情况下,这将陷入僵局。将__syncthreads()
置于if
条件
同时将dim3 DimGrid(BLOCKX, BLOCKY);
替换为dim3 DimGrid(BLOCKY, BLOCKX);
。那应该解决它