Question

我已经开始学习cuda一段时间了，我有以下问题

看看我在下面的表现如何：

复制GPU

int* B;
// ...
int *dev_B;    
//initialize B=0

cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...

//Execute on GPU the following function which is supposed to fill in 
//the dev_B matrix with integers


findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);

再次复制CPU

cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);

将数组B复制到dev_B只需要几分之一秒。但是，将数组dev_B复制回B需要永远。

findNeiborElem函数涉及每个线程的循环例如它看起来像那样

__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){

    int tid=threadIdx.x + blockIdx.x * blockDim.x;
    while (tid<dev_Nel[0]){
        for (int j=1;j<=Nel;j++){
             // do some calculations
             B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach
             break; 
        }
    tid += blockDim.x * gridDim.x; 
    }
}

非常奇怪的是，将dev_B复制到B的时间与j index的迭代次数成正比。

例如，如果Nel=5则时间约为5 sec。

当我增加Nel=20时，时间约为20 sec。

我希望复制时间应该独立于分配矩阵dev_B的值所需的内部迭代。

另外，我希望从CPU复制相同矩阵到CPU的时间也是一样的。

你知道出了什么问题吗？

Answer 1

不应使用clock（）来测量时间，而应使用事件：

通过事件，你会有这样的事情：

  cudaEvent_t start, stop;   // variables that holds 2 events 
  float time;                // Variable that will hold the time
  cudaEventCreate(&start);   // creating the event 1
  cudaEventCreate(&stop);    // creating the event 2
  cudaEventRecord(start, 0); // start measuring  the time

  // What you want to measure
  cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
  cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);

  cudaEventRecord(stop, 0);                  // Stop time measuring
  cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                            // work preceding the most recent call to cudaEventRecord()

  cudaEventElapsedTime(&time, start, stop); // Saving the time measured

编辑：其他信息：

“内核启动在完成之前将控制权返回给CPU线程。因此，你的计时结构正在测量内核执行时间和第二个memcpy。当在内核之后计时副本时，你的计时器代码正在被执行立即，但cudaMemcpy正在等待内核在启动之前完成。这也解释了为什么数据返回的时序测量似乎根据内核循环迭代而变化。这也解释了为什么花在内核函数上的时间“可以忽略不计” “”。积分到Robert Crovella

Answer 2

至于你的第二个问题

 B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach

在GPU上执行计算时，由于同步原因，完成作业的每个线程都不执行任何计算，直到同一工作组中的所有线程都完成为止。

换句话说，您需要执行此计算的时间将是最坏情况的时间，如果大多数线程没有完全停止，则无关紧要。< / p>

我不确定你的第一个问题，你如何衡量时间？我对cuda不是很熟悉，但我认为当从CPU复制到GPU时，实现会缓冲你的数据，隐藏所涉及的有效时间。

从GPU复制到CPU比将CPU复制到GPU要慢

2 个答案: