Question

首先，我的问题措辞不正确;我认为最好使用NVidia的CUDA C编程指南中的示例。

在第3.2.3节（共享内存）中，使用共享内存为Matrix Multiplication提供了以下代码 - 我希望我可以在这里复制它。

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;

// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;

// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

    // Get sub-matrix Asub of A
    Matrix Asub = GetSubMatrix(A, blockRow, m);

    // Get sub-matrix Bsub of B
    Matrix Bsub = GetSubMatrix(B, m, blockCol);

    // Shared memory used to store Asub and Bsub respectively
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Load Asub and Bsub from device memory to shared memory
    // Each thread loads one element of each sub-matrix
    As[row][col] = GetElement(Asub, row, col);
    Bs[row][col] = GetElement(Bsub, row, col);

    // Synchronize to make sure the sub-matrices are loaded
    // before starting the computation
    __syncthreads();

    // Multiply Asub and Bsub together
    for (int e = 0; e < BLOCK_SIZE; ++e)
        Cvalue += As[row][e] * Bs[e][col];

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    __syncthreads();
}

// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

在第7行：Matrix Csub = GetSubMatrix（C，blockRow，blockCol），每个线程都会执行该语句吗？这不会使使用共享内存减少全局内存访问量的全部意义无效吗？我的印象是，我在这里缺少一些基本的东西..

此外，肯定有更好的方式来提出这个问题。我只是不知道怎么做！

谢谢，

Zakiir

Answer 1

每个线程同时执行相同的指令（或空闲），因此每个线程进入GetSubMatrix是。每个线程都需要几个项目。因此，如果要复制N个线程和3N项，则每个线程将复制3个。

例如，如果我正在复制矢量，我可能会执行以下操作

float from* = ???;
float to*   = ???;
int   num   = ???;
int   thread = threadIdx.x + threadIdx.y*blockDim.x ...; // A linear index
int   num_threads = blockDim.x * blockDim.y * blockDim.z;
for(int i=threadIdx.x; i < num; i+= num_threads) {
     to[i] = from[i];
}

每个线程都涉及一次复制一个位。顺便说一句：如果你能够设法让所有线程复制一系列连续的元素，你就可以在副本中获得奖励速度。

CUDA：线程中的变量声明 - 是否有重叠？

1 个答案: