Question

我无法理解天真前缀和的cuda代码。

这段代码来自https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html 在示例39-1（天真扫描）中，我们有这样的代码：

 __global__ void scan(float *g_odata, float *g_idata, int n)
    {
    extern __shared__ float temp[]; // allocated on invocation
    int thid = threadIdx.x;
    int pout = 0, pin = 1;
    // Load input into shared memory.
    // This is exclusive scan, so shift right by one
    // and set first element to 0
    temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
 __syncthreads();

 for (int offset = 1; offset < n; offset *= 2)
  {
    pout = 1 - pout; // swap double buffer indices
    pin = 1 - pout;
    if (thid >= offset)
      temp[pout*n+thid] += temp[pin*n+thid - offset];
    else
      temp[pout*n+thid] = temp[pin*n+thid];
    __syncthreads();
  }
  g_odata[thid] = temp[pout*n+thid1]; // write output
}

我的问题是

为什么我们需要创建共享内存temp？
为什么我们需要＆＃34; pout＆＃34;和＆＃34; pin＆＃34;变量？他们在做什么？由于我们这里最多只使用一个块和1024个线程，我们是否只能使用threadId.x来指定块中的元素？
在CUDA中，我们是否使用一个线程进行一次添加操作？是不是，如果我使用for循环，一个线程可以在一次迭代中完成什么（在给定一个线程的一个线程的情况下，在OpenMP中循环线程或处理器）？
我之前的两个问题似乎很天真......我认为关键是我不了解上述实现与伪代码之间的关系如下：

for d = 1 to log2 n do for all k in parallel do if k >= 2^d then x[k] = x[k – 2^(d-1)] + x[k]

这是我第一次使用CUDA，所以如果有人能回答我的问题，我会很感激...

Answer 1

1-将内容放入共享内存（SM）并在那里进行计算而不是使用全局内存会更快。在加载SM之后同步线程非常重要，因此__syncthreads。

2-这些变量可能用于澄清逆转算法中的顺序。它只是切换某些部分：

temp[pout*n+thid] += temp[pin*n+thid - offset];

第一次迭代; pout = 1且pin = 0.第二次迭代; pout = 0且pin = 1。它在奇数次迭代时偏移N量的输出，并在偶数次迭代时偏移输入。回到你的问题，你不能用threadId.x实现同样的东西，因为它不会在循环中改变。

3＆amp; 4 - CUDA执行线程来运行内核。这意味着每个线程分别运行该代码。如果您查看伪代码并与CUDA代码进行比较，您已经使用CUDA并行化了外部循环。因此，每个线程都会在内核中运行循环，直到循环结束，并在写入全局内存之前等待每个线程完成。

希望它有所帮助。

使用CUDA

1 个答案: