Question

我在上一个主题中找到了一些关于cuda矩阵向量积的代码： Matrix-vector multiplication in CUDA: benchmarking & performance 我首先想知道为什么作者没有为dA（矩阵）使用共享内存？

然后，为什么列主要排序比行主要排序更快？

以下是代码：

    template<typename T>
__global__ void matvec_kernel(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols)
{
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ T x_shared[BLOCK_SIZE];

    T y_val = 0.0;

    #pragma unroll
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m)
    {
        if ((m * BLOCK_SIZE + threadIdx.x) <  nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE];
        else                                         x_shared[threadIdx.x] = 0.f;
        __syncthreads();

        #pragma unroll
        for (unsigned int e = 0; e < BLOCK_SIZE; ++e) {
            // --- Column-major ordering - faster
            y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e];
            // --- Row-major ordering - slower
            //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e];
        }

        __syncthreads();
    }

    if (tid < nRows) dy[tid] = y_val;

}

我现在正在考虑这两个问题，这就是我在这里的原因。

非常感谢！

Answer 1

此处的共享内存可用作缓存。矢量的分量将被多次读取，但矩阵的分量在计算过程中只能读取一次。这就是为什么代码只缓存矢量而不是矩阵。

列主矩阵更快，因为在读取矩阵时，线程沿矩阵列组织。 Col-major因此确保了coalesced global memory access。如果矩阵是行主要的，则应以不同的方式实现CUDA内核以实现最大性能。

矩阵矢量产品CUDA表现

1 个答案: