Question

最近我开始与CUDA合作，我读了一本关于计算语言的入门书。为了看我是否理解得很好，我考虑了以下问题。

考虑一个函数最小化网格[-1,1] X [-1,1]上的f（x，y）。这为我提供了一些实际问题，我想看看你的情况。

我是否明确计算了网格？如果我在CPU上创建网格，那么我必须将信息传输到GPU。然后我可以使用2D块布局并使用纹理内存有效地访问数据。那么最好使用方块还是不同形状的块？
假设我没有明确地制作网格。我可以使用常量浮点数组（提供快速内存访问）分配X和Y方向的离散度，然后使用1个块列表。

谢谢！

Answer 1

对我来说，这是一个有趣的问题，因为它代表了一种我认为很少见的问题：

潜在的高计算负载
很少甚至没有需要传送主机的数据 - >设备
需要通信设备的结果量非常少 - ＆gt;主机

换句话说，几乎所有的计算，都不太依赖数据传输，甚至全局内存使用/带宽。

话虽如此，这个问题似乎正在寻找功能优化/最小化的强力搜索方法，对于适合其他优化方法的函数来说不是一种有效的技术。但作为一种学习练习，它很有趣（无论如何）。对于其他难以处理的函数，例如具有不连续性或其他不规则性的函数，它也可能是有用的。

回答你的问题：

我是否明确计算了网格？如果我在CPU上创建网格，那么我必须将信息传输到GPU。然后我可以使用2D块布局并使用纹理内存有效地访问数据。那么最好使用方块还是不同形状的块？

我不打算在CPU上计算网格。（我假设是＆＃34; grid＆＃34;你的意思是f在网格上每个点的功能值。）首先，这是一个计算密集型的任务 - GPU擅长，并且其次，它可能是一个大型数据集，因此将其传输到GPU（因此GPU可以进行搜索）需要时间。我建议让GPU执行此操作（计算每个网格点的功能值。）由于我们不能使用全局数据访问，因此纹理内存不是问题。

假设我没有明确地制作网格。我可以使用常量浮点数组（提供快速内存访问）分配X和Y方向的离散度，然后使用1个块列表。

是的，你可以使用1D数组的块（列表）或2D数组。我不认为这会对问题产生重大影响，我认为2D网格方法可以更好地解决问题（我认为允许更清晰的代码）所以我建议从2D数组块开始。

这是一个示例代码，可能有趣的是玩弄或结晶想法。每个线程都有责任计算其x和y的相应值，然后计算该点的功能值f。然后使用减少后跟块减少用于搜索最小值的所有计算值（在这种情况下）。

$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>

// grid dimensions and divisions

#define XNR -1.0f
#define XPR  1.0f
#define YNR -1.0f
#define YPR  1.0f
#define DX   0.0001f
#define DY   0.0001f

// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16

// optimization functions - these are currently set for minimization

#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)

// error check macro

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
  return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}


// variable for block-draining reduction block counter
__device__ int blkcnt = 0;

// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){

  __shared__ float sh_f[BLK_X*BLK_Y];
  __shared__ float sh_x[BLK_X*BLK_Y];
  __shared__ float sh_y[BLK_X*BLK_Y];
  __shared__ int lblock;

// compute x,y coordinates for this thread
  float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
  float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;

  int thid = (threadIdx.y*BLK_X)+threadIdx.x;
  lblock = 0;
  sh_x[thid] = x;
  sh_y[thid] = y;
  sh_f[thid] = f(x,y);  // compute functional value of f(x,y)
  __syncthreads();

// perform block-level shared memory reduction
  // assume block size is a power of 2
  for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
    if (thid < i)
      if (TST(sh_f[thid],sh_f[thid+i])){
        sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
        sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
        sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
    __syncthreads();}
  volatile float *vf = sh_f;
  volatile float *vx = sh_x;
  volatile float *vy = sh_y;
  for (int i = 16; i > 0; i>>=1)
    if (thid < i)
      if (TST(vf[thid],vf[thid+i])){
        vf[thid] = OPT(vf[thid],vf[thid+i]);
        vx[thid] = OPT(vx[thid],vx[thid+i]);
        vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
  if (!thid){
    bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
    bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
    by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
    int myblock = atomicAdd(&blkcnt, 1);
    if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
  __syncthreads();
  if (lblock){
    // do last-block reduction
    float my_x, my_y, my_f;
    int myid = thid;
    if (myid < gridDim.x * gridDim.y){
      my_x = bx[myid];
      my_y = by[myid];
      my_f = bf[myid];}
    else { assert(0);} // does not work correctly if block dims are greater than grid dims
    myid += blockDim.x*blockDim.y;
    while (myid < gridDim.x*gridDim.y){
      if TST(my_f,bf[myid]){
        my_x = OPT(my_x,bx[myid]);
        my_y = OPT(my_y,by[myid]);
        my_f = OPT(my_f,bf[myid]);}
      myid += blockDim.x*blockDim.y;}
    sh_f[thid] = my_f;
    sh_x[thid] = my_x;
    sh_y[thid] = my_y;
    __syncthreads();
    for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
      if (thid < i)
        if (TST(sh_f[thid],sh_f[thid+i])){
          sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
          sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
          sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
      __syncthreads();}
    if (!thid){
      bf[0] = sh_f[0];
      bx[0] = sh_x[0];
      by[0] = sh_y[0];
      }
    }
}

// cpu (naive,serial) function for comparison

float3 opt_cpu(){
  float optx = XNR;
  float opty = YNR;
  float optf = f(optx,opty);
  for (float x = XNR; x < XPR; x += DX)
    for (float y = YNR; y < YPR; y += DY){
      float test = f(x,y);
      if (TST(optf,test)){
        optf = OPT(optf,test);
        optx = OPT(optx,x);
        opty = OPT(opty,y);}}
  return make_float3(optf, optx, opty);
}

int main(){

// compute threadblock and grid dimensions
  int nx = ceil(XPR-XNR)/DX;
  int ny = ceil(YPR-YNR)/DY;
  int bx = ceil(nx/(float)BLK_X);
  int by = ceil(ny/(float)BLK_Y);
  dim3 threads(BLK_X, BLK_Y);
  dim3 blocks(bx, by);
  float *d_bx, *d_by, *d_bf;
  cudaFree(0);
// run GPU test case
  unsigned long gtime = dtime_usec(0);
  cudaMalloc(&d_bx, bx*by*sizeof(float));
  cudaMalloc(&d_by, bx*by*sizeof(float));
  cudaMalloc(&d_bf, bx*by*sizeof(float));
  opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
  float rf, rx, ry;
  cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
  cudaCheckErrors("some error");
  gtime = dtime_usec(gtime);
  printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
  unsigned long ctime = dtime_usec(0);
  float3 cpu_res = opt_cpu();
  ctime = dtime_usec(ctime);
  printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);

  return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$

注意：

此问题设置为在域上找到f（x，y）=（x + 0.5）^ 2 +（y + 0.5）^ 2 + 0.1的最小值：x（-1,1）， Y（-1,1）
测试在Fedora 20，CUDA 7，Quadro5000 GPU（cc2.0）和Xeon X5560 2.8GHz CPU上运行。不同的CPU或GPU显然会影响比较。
这里观察到的加速比约为14倍。 CPU代码是一个天真的单线程代码。
例如，应该可以通过修改OPT和TST宏来执行不同类型的优化 - 例如最大值而不是最小值。
要搜索的域（和网格）维度和粒度可以通过编译时常量修改，例如XNR，XPR等。

使用CUDA进行网格搜索的最佳策略

1 个答案: