Question

这仅仅是一个思想实验，但我想检查一下我对CUDA执行模型的理解。考虑以下情况：

我在GPU上运行的双精度性能很差（非特斯拉卡）。
我有一个需要使用双精度计算值的内核。该值是内核运行时的其余部分的常量，和在整个warp中也是常量。

类似下面的伪代码是否有利？

// value that we use later in the kernel; this is constant across all threads
// in a warp
int constant_value;
// check to see if this is the first thread in a warp
enum { warp_size = 32 };
if (!(threadIdx.x & (warp_size - 1))
{
    // only do the double-precision math in one thread
    constant_value = (int) round(double_precision_calculation());
}
// broadcast constant_value to all threads in the warp
constant_value = __shfl(v, 0);
// go on to use constant_value as needed later in the kernel

我考虑这样做的原因是我（可能是错误的）理解如何在每个多处理器上提供双精度资源。据我所知，在最近的Geforce卡上，只有1/32的双精度ALU和单精度ALU一样多。这是否意味着如果warp中的其他线程发散，我可以解决这种资源缺乏问题，并且仍然可以获得不错的性能，只要我想要的双精度值可以广播到warp中的所有线程？ / p>

Answer 1

这是否意味着如果warp中的其他线程发散，我可以解决这种资源不足问题，并且仍然可以获得不错的性能，只要我想要的双精度值可以广播到所有线程中翘曲？

不，你不能。

指令问题总是在warp级别发生，即使在warp-diverged场景中也是如此。由于它是在warp级别发出的，因此它将需要/使用/为warp安排足够的执行资源，即使对于非活动线程也是如此。

因此，仅在一个线程上进行的计算仍将使用与在warp中的所有32个线程上完成的计算相同的资源/调度槽。

例如，浮点乘法将需要32个浮点ALU使用实例。精确的调度将根据特定的GPU而有所不同，但您不能通过扭曲分歧或任何其他机制将32实例使用减少到较低的数量。

根据评论中的问题，这里有一个关于CUDA 7.5，Fedora 20，GT640（GK208 - DP与SP单位的比例为1/24）的实例：

$ cat t1241.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int nTPB = 32;
const int nBLK = 1;
const int rows = 1048576;
const int nSD = 128;

typedef double mytype;
template <bool use_warp>
__global__ void mpy_k(const mytype * in, mytype * out){
  __shared__ mytype sdata[nTPB*nSD];
  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  mytype accum = in[idx];
#pragma unroll 128
  for (int i = 0; i < rows; i++)
    if (use_warp)
      accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
    else
      if (threadIdx.x == 0)
        accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
  out[idx] = accum;
}

int main(){
  mytype *din, *dout;
  cudaMalloc(&din, nTPB*nBLK*rows*sizeof(mytype));
  cudaMalloc(&dout, nTPB*nBLK*sizeof(mytype));
  cudaMemset(din, 0, nTPB*nBLK*rows*sizeof(mytype));
  cudaMemset(dout, 0, nTPB*nBLK*sizeof(mytype));
  mpy_k<true><<<nBLK, nTPB>>>(din, dout); // warm-up
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  mpy_k<true><<<nBLK, nTPB>>>(din, dout);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  printf("full warp elapsed time: %f\n", dt/(float)USECPSEC);
  mpy_k<false><<<nBLK, nTPB>>>(din, dout); //warm up
  cudaDeviceSynchronize();
  dt = dtime_usec(0);
  mpy_k<false><<<nBLK, nTPB>>>(din, dout);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  printf("one thread elapsed time: %f\n", dt/(float)USECPSEC);
  cudaError_t res = cudaGetLastError();
  if (res != cudaSuccess) printf("CUDA runtime failure %s\n", cudaGetErrorString(res));
  return 0;
}

$ nvcc -arch=sm_35 -o t1241 t1241.cu
$ CUDA_VISIBLE_DEVICES="1" ./t1241
full warp elapsed time: 0.034346
one thread elapsed time: 0.049174
$

在warp中只使用一个线程进行浮点乘法

并不快

在这种情况下，CUDA分支能否帮助我？

1 个答案: