Question

我正在打开这个主题，因为我注意到我的代码输出中的奇怪行为，同时试图深入了解CUDA中的一些基本概念，如速度与块/线程数等等...任何帮助将不胜感激！

首先，这里有我的显卡的一些规格：
名称：GeForce 8600M GT
多处理器数量：4
每块最大螺纹数：512
最大网格尺寸：（65535,65535,1）

我正在使用以下简单代码。它用1s填充3个长度为N的数组，并计算总和。总和显然是可预测的，等于3N。

#include <iostream>
#include "ArgumentParser.h"

//using namespace std;

__global__ void addVector(int *a, int *b, int *c, int *d, int *N){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid<*N) {
        d[tid] = a[tid] + b[tid] + c[tid];
    }
}

int main(int argc, char *argv[]) {
    //Handy way to pass command-line arguments.
    ArgumentParser parser(argc, argv);
    int nblocks = parser("-nblocks").asInt(1);
    int nthreads = parser("-nthreads").asInt(1);

    //Defining arrays on host.
    int N = 100000;
    int a[N];
    int b[N];
    int c[N];
    int d[N];

    //Pointers to the arrays that will go to the device.
    int *dev_a;
    int *dev_b;
    int *dev_c;
    int *dev_d;
    int *dev_N;

    //Filling up a, b, and c.
    for (int i=0; i<N; i++){
        a[i] = 1;
        b[i] = 1;
        c[i] = 1;
    } 

    //Modifying the memory adress of dev_x so that dev_x is on the device and //
    //the proper memory size is reserved for it.  
    cudaMalloc((void**)&dev_a, N * sizeof(int));
    cudaMalloc((void**)&dev_b, N * sizeof(int));
    cudaMalloc((void**)&dev_c, N * sizeof(int));
    cudaMalloc((void**)&dev_d, N * sizeof(int));
    cudaMalloc((void**)&dev_N, sizeof(int));

    //Copying the content of a/b/c and N to from the host to the device.
    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, c, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_N, &N, sizeof(int), cudaMemcpyHostToDevice);

    //Initializing the cuda timers.
    cudaEvent_t start, stop;
    cudaEventCreate(&start); 
    cudaEventCreate(&stop);
    cudaEventRecord (start, 0);

    //Executing the kernel.
    addVector<<<nblocks, nthreads>>>(dev_a, dev_b, dev_c, dev_d, dev_N);

    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float time;
    cudaEventElapsedTime(&time, start, stop);
    printf ("CUDA time: %3.5f s\n", time/1000);

    //Copying the result from device to host.
    cudaMemcpy(d, dev_d, N * sizeof(int), cudaMemcpyDeviceToHost);

    //Freeing the memory allocated on the GPU
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);
    cudaFree(dev_d);
    cudaFree(dev_N);

    //Checking the predictable result.
    int sum=0;
    for (int i=0; i<N; i++){
        sum += d[i];
    }
    printf("Result of the sum: %d. It should be: %d.\n", sum, 3*N);
}

问题1：
当我编译代码并输入：

./addArrayCuda -nblocks 1 -nthreads 1

我得到了答案：

Result of the sum: -642264408. It should be: 300000.

这似乎是合理的。我使用单个块与单个线程。只会添加每个数组的第一个元素。其余的元素是一些随机值，它们加起来是不可预测的。它应该是nblocks * nthreads＆gt; = N.所以让我们试试：

./addArrayCuda -nblocks 3125 -nthreads 32

输出结果为：

Result of the sum: 300000. It should be: 300000.

这是有道理的。 3125 * 32 = 100000 = N.到此为止，一切都很好。但是，如果我重新运行上一个命令（nblocks = nthreads = 1）而不重新编译，我得到：

./addArrayCuda -nblocks 1 -nthreads 1
Result of the sum: 300000. It should be: 300000.

发生了什么事？

问题2： 这个问题是关于nblocks / nthreads与执行速度之间的关系。我知道如果代码中的问题解释问题1 ，这个问题可能没有多大意义，但我仍然会问它。我已经查看了代码的执行时间（平均超过5次运行）和不同数量的块/线程，但确保nblocks * nthreads＆gt; N.这是我所拥有的（我有一个很好的情节，但没有足够的声誉发布它...）：

（nblocks，nthreads）执行时间[s]增加率
（196,512）5.0e-4 -
（391,256）4.8e-4 1.0
（782,128）4.8e-4 1.0
（1563,64）4.9e-4 1.0
（3125,32）5.0e-4 1.0
（6250,16）5.2e-4 1.0
（12500,8）9.0e-4 1.7
（25000,4）1.3e-3 1.4
（50000,2）2.3e-3 1.8

我的解释：GPU被分成块，每个块被分成线程。每个时钟周期GPU将内核发送到4个块（多处理器计数），并在每个块内发送到warp（32个线程组）。这意味着使用多个不是32的倍数的线程是浪费资源。因此，我们可以理解（nblocks，nthreads）与执行时间之间的一般关系。从（196,512）到（3125,32），GPU采用的时钟周期数大致相同，并且近似与（nblocks / 4）*（nthreads / 32）成比例。但是，我们粗略期望在（3125,32）和（6250,16），（6250,16）和（12500,8）之间执行时间加倍，依此类推。

为什么不是这种情况？ 更具体地说，为什么（3125,32）和（6250,16）之间的执行时间没有任何显着差异？？

我感谢你花时间阅读，直到这里;-)

Answer 1

A1

使用blocks=threads=1时，您只计算d[0]，并保持d[1...9999]不变。然后，由于未初始化sum，您将获得d[1...9999]。{/ p>

您可以使用全零来初始化d[0...9999]以获得恒定的结果。

在第3次实验中，sum==30000 -nblocks 1 -nthreads 1 d[]可能是一个巧合，即程序在与上次运行完全相同的空间中分配{{1}}，并且空间中的值不变。所以你得到的结果与第二次实验相同，而不是正确的结果。

A2

估算时间成本时可能需要考虑的两个原因。

你的内核几乎没有算术运算，这使它成为带宽有限的内核。当内存访问合并时，计算线程数可能不是性能瓶颈。
您的数据量很小。内核启动开销可能太大而无法忽略。

通过向量加法理解CUDA的基本概念

1 个答案:

A1

A2