Question

免责声明：我不是完全迷失在这里，但我只是需要一些指导。我正在使用在2D阵列中逐像素存储的图像。该数组是Image类的数据成员。该程序完美地作为一个串行程序。总之...

class Image{
    int rows;
    int cols;
    int ** pixels; //2D array
};

像素以这种格式存储：pixels[rows][cols]

我知道我无法访问__global__ Cuda函数中的数据成员，这就是我被困住的地方。我需要：

1) Access the data member (pixels)
2) Copy everything to Cuda 
3) Do work on it
4) Get it all back
5) Store it back into pixels

所以我的问题是，如何在我的Cuda功能中复制和使用这些数据？

这就是：

__global__ void cuda_negate_image(int ** new_array, int ** old_array, int rows, int cols){

    int i = blockIdx.y*blockDim.y + threadIdx.y;
    int j = blockIdx.x*blockDim.x + threadIdx.x;

    if (i < rows && j < cols) {
        new_array[i][j] = -(old_array[i][j]) + 255;
    }
}

我知道如何使用指针，但不是指针指针：（。

Answer 1

正如罗伯特在评论中指出的那样，这是一个非常常见的问题，这个问题经常出现，而我相当古老的答案突出了大部分要点，尽管它可能不是我们可能应该拥有的典型例子。

真正简短的回答是，您需要先在主机内存中构建设备指针数组，然后将该数组复制到设备中。将代码转换为在设备上分配内存的简单示例，可以得到如下内容：

class Image{
    public:

    int rows;
    int cols;
    int ** pixels; //2D array

    __host__ __device__
    Image() {};
    __host__ __device__
    Image(int r, int c, int** p) : rows(r), cols(c), pixels(p) {};
};

__global__ void intialiseImage(Image image, const int p_val)
{
    int i = blockIdx.y*blockDim.y + threadIdx.y;
    int j = blockIdx.x*blockDim.x + threadIdx.x;

    if (i < image.rows && j < image.cols) {
        image.pixels[i][j] = p_val;
    }
}

int** makeDeviceImage(Image& dev_image, const int rows, const int cols)
{
    int** h_pixels = new int*[rows];
    for(int i=0; i<rows; i++) {
        cudaMalloc((void **)&h_pixels[i], sizeof(int) * size_t(cols));
    }
    int** d_pixels;
    cudaMalloc((void**)&d_pixels, sizeof(int*) * size_t(rows));
    cudaMemcpy(d_pixels, &h_pixels[0], sizeof(int*) * size_t(rows), cudaMemcpyHostToDevice);

    dev_image = Image(rows, cols, d_pixels);

    return h_pixels;
}


int main(void)
{
    int rows = 16, cols = 32;

    Image dev_image;
    int** dev_pixels = makeDeviceImage(dev_image, rows, cols);

    intialiseImage<<<rows, cols>>>(dev_image, 128);
    cudaDeviceSynchronize();
    cudaDeviceReset();

    return 0;
}

我将把复制代码留作读者的练习（提示：函数返回的指针数组在那里非常有用），但是有一个值得做的评论。查看该代码的此探查器输出：

>nvprof a.exe
==5148== NVPROF is profiling process 5148, command: a.exe
==5148== Profiling application: a.exe
==5148== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 75.82%  2.2070us         1  2.2070us  2.2070us  2.2070us  intialiseImage(Image, int)
 24.18%     704ns         1     704ns     704ns     704ns  [CUDA memcpy HtoD]

==5148== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.33%  309.01ms        17  18.177ms  20.099us  308.62ms  cudaMalloc
  0.50%  1.5438ms        83  18.599us     427ns  732.97us  cuDeviceGetAttribute
  0.07%  202.70us         1  202.70us  202.70us  202.70us  cuDeviceGetName
  0.04%  136.84us         1  136.84us  136.84us  136.84us  cudaDeviceSynchronize

  0.03%  92.370us         1  92.370us  92.370us  92.370us  cudaMemcpy
  0.02%  76.974us         1  76.974us  76.974us  76.974us  cudaLaunch
  0.01%  24.375us         1  24.375us  24.375us  24.375us  cuDeviceTotalMem
  0.00%  5.5580us         2  2.7790us  2.5650us  2.9930us  cuDeviceGetCount
  0.00%  4.2760us         1  4.2760us  4.2760us  4.2760us  cudaConfigureCall
  0.00%  3.4220us         2  1.7110us     856ns  2.5660us  cudaSetupArgument
  0.00%  2.5660us         2  1.2830us  1.2830us  1.2830us  cuDeviceGet

在我测试的平台上（Windows 8，移动类Fermi GPU），在图像中写入值的内核大约需要2us。 cudaMalloc电话至少需要20us。并且有17个malloc调用来分配这个简单的小数组。使用CUDA中的指针数组的开销非常大，如果性能是您的首要任务，我也不建议使用它。

在Cuda中使用2D阵列数据成员

1 个答案: