Question

以下全球障碍适用于Kepler K10而不是Fermi GTX580：

__global__ void cudaKernel (float* ref1, float* ref2, int* lock, int time, int dim) {
  int gid  = blockIdx.x * blockDim.x + threadIdx.x;
  int lid  = threadIdx.x;                          
  int numT = blockDim.x * gridDim.x;               
  int numP = int (dim / numT);                     
  int numB = gridDim.x;

  for (int t = 0; t < time; ++t) {
    // compute @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref2 [idx]  = 0.333f * ((ref1 [idx - 1] + ref1 [idx]) + ref1 [idx + 1]);
    }

    // global sync
    if (lid == 0){
      atomicSub (lock, 1);
      while (atomicCAS(lock, 0, 0) != 0);
    }
    __syncthreads();

    // copy-back @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref1 [idx]  = ref2 [idx];
    }

    // global sync
    if (lid == 0){
      atomicAdd (lock, 1);
      while (atomicCAS(lock, numB, numB) != numB);
    }
    __syncthreads();
  }
}

因此，通过查看发送回CPU的输出，我注意到一个线程（第一个或最后一个线程）逃脱了屏障并且比其他线程更早地恢复执行。我正在使用CUDA 5.0。块数也总是小于SM的数量（在我的运行集中）。

知道为什么相同的代码不适用于两种架构？开普勒有哪些新功能可以帮助实现全球同步？

Answer 1

所以我怀疑屏障代码本身可能以同样的方式工作。它似乎正在发生在与障碍功能本身无关的其他数据结构上，这似乎是一个问题。

Niether Kepler和Fermi的L1缓存彼此连贯。您发现的内容（虽然它与您的屏障代码本身无关）是Kepler和Fermi之间的L1缓存行为不同。

特别是，Kepler L1缓存在全局负载上不起作用，如上面的链接所述，因此缓存行为在L2级处理，这是设备范围的，因此是连贯的。当Kepler SMX读取它的全局数据时，它从L2获得相干值。

另一方面，Fermi有L1缓存也参与全局加载（默认情况下 - 尽管可以关闭此行为），上面链接中描述的L1缓存对于每个Fermi SM都是唯一的，并且是非 - 与其他SM中的L1缓存相干。当Fermi SM读取它的全局数据时，它从L1获取值，这可能与其他SM中的其他L1缓存不一致。

这就是你所看到的“一致性”与你在障碍之前和之后操纵的数据的区别。

正如我所提到的，我相信屏障代码本身可能在两种设备上的工作方式相同。

CUDA Global Barrier - 适用于开普勒而不是费米

1 个答案: