Question

我的理解是，在CUDA中，增加块数不会增加并行实现的时间，但在我的代码中，如果我将块数加倍，时间也会翻倍。

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#define num_of_blocks 500
#define num_of_threads 512

__constant__ double y = 1.1;

// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &globalState[idx]);
}

// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
    // global threads id
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;

    // local threads id
    int lidx    = threadIdx.x;

    // creat shared memory to store seeds
    __shared__ curandState localState[num_of_threads];

    // shared memory to store samples
    __shared__ double sample[num_of_threads];

    // copy global seed to local
    localState[lidx]    = globalState[idx];
    __syncthreads();

    sample[lidx]    =  y + curand_normal_double(&localState[lidx]);

    if(lidx == 0){
        // save the first sample to dev_data;
        dev_data[blockIdx.x] = sample[0];
    }

    globalState[idx]    = localState[lidx];
}

int main(){
    // creat random number seeds;
    curandState *globalState;
    cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
    initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);

    double *dev_data;
    cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // Start record
    cudaEventRecord(start, 0);

    test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);

    // Stop event
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
    // Clean up:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    std::cout << "Time ellapsed: " << elapsedTime << std::endl;

    cudaFree(dev_data);
    cudaFree(globalState);
    return 0;
}

测试结果是：

number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.

那么时间会增加的原因是什么？是因为我访问常量内存还是将数据从共享内存复制到全局内存？这是优化它的一些方法吗？

Answer 1

虽然能够并行运行的块数可能很大，但由于片上资源有限，它仍然是有限的。如果内核启动中请求的块数超过该限制，则任何其他块必须等待较早的块完成并释放其资源。

一个有限的资源是共享内存，其内核使用28千字节。兼容CUDA 8.0的Nvidia GPU每个流多处理器（SM）提供48到112千字节的共享内存，因此任何时候运行的最大块数都是GPU上SM数量的1倍到3倍之间。

其他有限的资源是调度程序中的寄存器和各种每个warp资源。 CUDA occupancy calculator是一个方便的Excel电子表格（也适用于OpenOffice / LibreOffice），它向您展示这些资源如何限制特定内核的每个SM的块数。编译内核，将选项--ptxas-options="-v"添加到nvcc命令行，找到说“ptxas info：Used XX register”， YY bytes smem的行， zz bytes cmem [0]， ww bytes cmem [2]“，并输入 XX ， YY ，您尝试启动的每个块的线程数，以及GPU在电子表格中的计算能力。然后它将显示可在一个SM上并行运行的最大块数。

你没有提到你一直在运行测试的GPU，所以我将以GTX 980为例。它有16个SM，每个共享内存为96Kb，因此最多16×3 = 48个块可以并行运行。如果你没有使用共享内存，驻留warp的最大数量会将每个SM的块数限制为4，允许64个块并行运行。

在任何现有的Nvidia GPU上，您的示例需要至少大约十几个块顺序执行，这解释了为什么加倍块数也会使运行时间增加一倍。

为什么增加库达的街区数量会增加时间？

1 个答案: