Cuda L2转移开销

时间:2015-12-24 16:39:33

标签: cuda disassembly nsight ldg

我有一个内核用atomicMin来测试渲染点。测试设置在构思案例内存布局中有很多点。两个缓冲区,一个uint32用于256x uint32的集群。

namespace Point
{
struct PackedBitfield
{
    glm::uint32_t x : 6;
    glm::uint32_t y : 6;
    glm::uint32_t z : 6;
    glm::uint32_t nx : 4;
    glm::uint32_t ny : 4;
    glm::uint32_t nz : 4;
    glm::uint32_t unused : 2;
};

union __align__(4) Packed
{
    glm::uint32_t bits;
    PackedBitfield field;
};

struct ClusterPositionBitfield
{
    glm::uint32_t x : 10;
    glm::uint32_t y : 10;
    glm::uint32_t z : 10;
    glm::uint32_t w : 2;
};

union ClusterPosition
{
    glm::uint32_t bits;
    ClusterPositionBitfield field;
};
}

//
// launch with blockSize=(256, 1, 1) and grid=(numberOfClusters, 1, 1)
//
extern "C" __global__ void pointsRenderKernel(mat4 u_mvp,
                    ivec2 u_resolution,
                    uint64_t* rasterBuffer,
                    Point::Packed* points, 
                    Point::ClusterPosition* clusterPosition)
{
// extract and compute world position
const Point::ClusterPosition cPosition(clusterPosition[blockIdx.x]);
const Point::Packed point(points[blockIdx.x*256 + threadIdx.x]);

...use points and write to buffer...

}

生成的SASS如下所示:

enter image description here

查看内存分析器输出:Point::Packed*缓冲区读取的L2传输开销 3.0 为什么会这样?内存应该是完美对齐和顺序的。另外,为什么这会自动生成LDG(compute_50,sm_50)?我不需要这个缓存。

1 个答案:

答案 0 :(得分:0)

L2传输开销的工具提示中,它表示它测量" L1和#34;中每个请求字节在L1和L2之间实际传输的字节数,它也是说"越低越好"。

在我的情况下,阅读Point::Packed的L2转移开销为1.0

enter image description here

测试代码

namespace Point
{
    struct PackedBitfield
    {
        uint32_t x : 6;
        uint32_t y : 6;
        uint32_t z : 6;
        uint32_t nx : 4;
        uint32_t ny : 4;
        uint32_t nz : 4;
        uint32_t unused : 2;
    };

    union __align__(4) Packed
    {
        uint32_t bits;
        PackedBitfield field;
    };

    struct ClusterPositionBitfield
    {
        uint32_t x : 10;
        uint32_t y : 10;
        uint32_t z : 10;
        uint32_t w : 2;
    };

    union ClusterPosition
    {
        uint32_t bits;
        ClusterPositionBitfield field;
    };
}

__global__ void pointsRenderKernel(Point::Packed* points, Point::ClusterPosition* clusterPosition)
{
    int t_id = blockIdx.x * blockDim.x + threadIdx.x;

    clusterPosition[blockIdx.x + blockDim.x] = clusterPosition[blockIdx.x];
    points[t_id + blockDim.x * gridDim.x] = points[t_id];
}

void main()
{
    int blockSize = 256;
    int numberOfClusters = 256;

    std::cout << sizeof(Point::Packed) << std::endl;
    std::cout << sizeof(Point::ClusterPosition) << std::endl;

    Point::Packed *d_points;
    cudaMalloc(&d_points, sizeof(Point::Packed) * numberOfClusters * blockSize * 2);

    Point::ClusterPosition *d_clusterPositions;
    cudaMalloc(&d_points, sizeof(Point::ClusterPosition) * numberOfClusters * 2);

    pointsRenderKernel<<<numberOfClusters, blockSize>>>(d_points, d_clusterPositions);
}

<强>更新

在我使用最新驱动程序之前,我在Nsight遇到了一些其他问题。我将驱动程序降级为默认CUDA 8.0.61安装程序(从here下载)附带的版本,并修复了该问题。安装程序附带的版本是376.51。在Windows 10 64位和Visual Studio 2015上测试,Nsight版本为5.2,我的显卡为cc6.1。

这是我的完整编译器命令:

  

nvcc.exe -gencode = arch = compute_61,code = \&#34; sm_61,compute_61 \&#34; --use-local-env --cl-version 2015 -Xcompiler&#34; / wd 4819&#34; -ccbin&#34; C:\ Program Files(x86)\ Microsoft Visual Studio 14.0 \ VC \ bin \ x86_amd64&#34; -I&#34; C:\ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v8.0 \ include&#34; -lineinfo --keep-dir x64 \ Release -maxrregcount = 0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler&#34; / EHsc / W3 / nologo / O2 / FS / Zi / MD&#34; -o x64 \ Release \ kernel.cu.obj kernel.cu&#34;

更新2

当我使用sm_50,compute_50选项进行编译时得到相同的结果:1.0用于L2传输开销。