Question

我试图计算简单向量添加内核的DRAM（全局内存）访问次数。

__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
    int blockStartIndex  = blockIdx.x * blockDim.x * N;
    int threadStartIndex = blockStartIndex + threadIdx.x;
    int threadEndIndex   = threadStartIndex + ( N * blockDim.x );
    int i;

    for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
        C[i] = A[i] + B[i];
    }
}

网格尺寸= 180 块大小= 128

数组的大小= 180 * 128 * N浮点数，其中N是输入参数（每个线程的元素数）

当N = 1时，数组的大小= 180 * 128 * 1浮点数= 90KB

应从DRAM读取所有阵列A，B和C.

因此从理论上讲，

DRAM写入（C）= 2880（32字节访问） DRAM读取（A，B）= 2880 + 2880 = 5760（32字节访问）

但是当我使用nvprof

时

DRAM写入= fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880（32字节访问） DRAM读取= fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30（32字节访问）

现在这是问题所在。理论上应该有5760个DRAM读取，但nvprof只报告30个，对我来说这看起来不可能。此外，如果你将矢量的大小加倍（N = 2），报告的DRAM访问仍然是30。

如果有人可以发光，那就太棒了。

我已使用编译器选项“-Xptxas -dlcm=cg”

禁用了L1缓存

谢谢， Waruna

Answer 1

如果在内核启动之前已经完成cudaMemcpy将源缓冲区从主机复制到设备，则会在L2缓存中获取源缓冲区，因此内核看不到来自L2的读取错误获得更少的（fb_subp0_read_sectors + fb_subp1_read_sectors）。

如果您在内核发布之前注释掉cudaMemcpy，您会发现fb_subp0_read_sectors和fb_subp1_read_sectors的事件值包含您期望的值。

nvprof事件“fb_subp0_read_sectors”和“fb_subp1_read_sectors”未报告正确的结果

1 个答案: