指令级并行(ILP)和NVIDIA GPU上的无序执行

时间:2013-07-26 12:26:08

标签: cuda nvidia

NVIDIA GPU是否支持无序执行?

我的第一个猜测是它们不包含如此昂贵的硬件。但是,在阅读CUDA progamming guide时,指南建议使用指令级并行(ILP)来提高性能。


2 个答案:

答案 0 :(得分:6)

Pipelining是一种常见的ILP技术,肯定可以在NVidia的GPU上实现。我猜您同意流水线操作不依赖于无序执行。 此外,NVidia GPU具有多个来自计算能力2.0及更高版本(2或4)的warp调度程序。如果你的代码在线程中有2个(或更多)连续且独立的指令(或者编译器以某种方式重新排序),你也可以从调度程序中利用这个ILP。

这是一个很好解释的问题,关于2宽warp调度程序+流水线如何协同工作。 How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

同时查看Vasily Volkov关于GTC 2010的演讲。他通过实验了解了ILP如何改善CUDA代码的性能。 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf


答案 1 :(得分:1)



除了代码之外,我还报告了在NVIDIA GT920M(开普勒架构)上针对NILP的不同值执行的时序。可以看出:

  1. 对于较大的N值,已达到GT920M卡最大值的内存带宽,即14.4GB/s;
  2. 对于任何固定的N,更改ILP的值不会改变效果。
  3. 关于第2点,我还在Maxwell上测试了相同的代码,并观察到相同的行为(性能与ILP没有变化)。如果针对ILP的效果发生变化,请参阅The efficiency and performance of ILP for the NVIDIA Kepler architecture报告对Fermi架构进行测试的答案。


    (2.f * 4.f * N * numITER) / (1e9 * timeTotal * 1e-3)


    4.f * N * numITER


    2.f * 4.f * N * numITER


    timeTotal * 1e-3



    // --- GT920m - 14.4 GB/s
    //     http://gpuboss.com/gpus/GeForce-GTX-280M-vs-GeForce-920M
    #include "Utilities.cuh"
    #include "TimingGPU.cuh"
    #define BLOCKSIZE    32
    #define DEBUG
    __global__ void ILPKernel(const int * __restrict__ d_a, int * __restrict__ d_b, const int ILP, const int N) {
        const int tid = threadIdx.x + blockIdx.x * blockDim.x * ILP;
        if (tid >= N) return;
        for (int j = 0; j < ILP; j++) d_b[tid + j * blockDim.x] = d_a[tid + j * blockDim.x];
    /* MAIN */
    int main() {
        //const int N = 8192;
        const int N = 524288 * 32;
        //const int N = 1048576;
        //const int N = 262144;
        //const int N = 2048;
        const int numITER = 100;
        const int ILP = 16;
        TimingGPU timerGPU;
        int *h_a = (int *)malloc(N * sizeof(int));
        int *h_b = (int *)malloc(N * sizeof(int));
        for (int i = 0; i<N; i++) {
            h_a[i] = 2;
            h_b[i] = 1;
        int *d_a; gpuErrchk(cudaMalloc(&d_a, N * sizeof(int)));
        int *d_b; gpuErrchk(cudaMalloc(&d_b, N * sizeof(int)));
        gpuErrchk(cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice));
        gpuErrchk(cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice));
        /* ILP KERNEL */
        float timeTotal = 0.f;
        for (int k = 0; k < numITER; k++) {
            ILPKernel << <iDivUp(N / ILP, BLOCKSIZE), BLOCKSIZE >> >(d_a, d_b, ILP, N);
    #ifdef DEBUG
            timeTotal = timeTotal + timerGPU.GetCounter();
        printf("Bandwidth = %f GB / s; Num blocks = %d\n", (2.f * 4.f * N * numITER) / (1e6 * timeTotal), iDivUp(N / ILP, BLOCKSIZE));
        gpuErrchk(cudaMemcpy(h_b, d_b, N * sizeof(int), cudaMemcpyDeviceToHost));
        for (int i = 0; i < N; i++) if (h_a[i] != h_b[i]) { printf("Error at i = %i for kernel0! Host = %i; Device = %i\n", i, h_a[i], h_b[i]); return 1; }
        return 0;


    GT 920M
    N = 512  - ILP = 1  - BLOCKSIZE = 512 (1 block  - each block processes 512 elements)  - Bandwidth = 0.092 GB / s
    N = 1024 - ILP = 1  - BLOCKSIZE = 512 (2 blocks - each block processes 512 elements)  - Bandwidth = 0.15  GB / s
    N = 2048 - ILP = 1  - BLOCKSIZE = 512 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.37  GB / s
    N = 2048 - ILP = 2  - BLOCKSIZE = 256 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.36  GB / s
    N = 2048 - ILP = 4  - BLOCKSIZE = 128 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.35  GB / s
    N = 2048 - ILP = 8  - BLOCKSIZE =  64 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.26  GB / s
    N = 2048 - ILP = 16 - BLOCKSIZE =  32 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.31  GB / s
    N = 4096 - ILP = 1  - BLOCKSIZE = 512 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.53  GB / s
    N = 4096 - ILP = 2  - BLOCKSIZE = 256 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.61  GB / s
    N = 4096 - ILP = 4  - BLOCKSIZE = 128 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.74  GB / s
    N = 4096 - ILP = 8  - BLOCKSIZE =  64 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.74  GB / s
    N = 4096 - ILP = 16 - BLOCKSIZE =  32 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.56  GB / s
    N = 8192 - ILP = 1  - BLOCKSIZE = 512 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4  GB / s
    N = 8192 - ILP = 2  - BLOCKSIZE = 256 (16 blocks - each block processes 512 elements) - Bandwidth = 1.1  GB / s
    N = 8192 - ILP = 4  - BLOCKSIZE = 128 (16 blocks - each block processes 512 elements) - Bandwidth = 1.5  GB / s
    N = 8192 - ILP = 8  - BLOCKSIZE =  64 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4  GB / s
    N = 8192 - ILP = 16 - BLOCKSIZE =  32 (16 blocks - each block processes 512 elements) - Bandwidth = 1.3  GB / s
    N = 16777216 - ILP = 1  - BLOCKSIZE = 512 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.9  GB / s
    N = 16777216 - ILP = 2  - BLOCKSIZE = 256 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8  GB / s
    N = 16777216 - ILP = 4  - BLOCKSIZE = 128 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8  GB / s
    N = 16777216 - ILP = 8  - BLOCKSIZE =  64 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.7  GB / s
    N = 16777216 - ILP = 16 - BLOCKSIZE =  32 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.6  GB / s