OpenCL vs CUDA:固定内存

时间:2018-06-13 16:08:44

标签: c++ opencl nvidia

我一直在将我的RabbitCT CUDA实施移植到OpenCL,而且我遇到了固定内存问题。

对于CUDA,创建一个主机缓冲区,用于缓冲要在固定内存中处理的输入图像。这允许主机在GPU处理当前批次时捕获下一批输入图像。我的CUDA实现的简化模型如下:

// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];

void initialize()
{
    // initiate streams
    for( uint s = 0; s < STREAMS_MAX; s++ ){
        cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
        cudaMalloc( (void**)&devProjection[s], imgSize);
    }

    // initiate buffers
    for( uint b = 0; b < BUFFER_SIZE; b++ ){
        cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
    }
}

// main function called for all input images
void backproject(imgdata* r)
{
    uint projNr = r->imgnr % BUFFER_SIZE;
    uint streamNr = r->imgnr % STREAMS_MAX;

    // When buffer is filled, wait until work in current stream has finished
    if(projNr == 0) {
        cudaStreamSynchronize(stream[streamNr]);
    }       

    // copy received image data to buffer (maps double precision to float)
    std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);

    // copy image and matrix to device
    cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );

    // call kernel
    backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}

因此,对于CUDA,我为每个缓冲项创建一个固定主机指针,并在执行每个流的内核之前将数据复制到设备。

对于OpenCL,我最初在遵循Nvidia OpenCL Best Practices Guide时做了类似的事情。在这里,他们建议创建两个缓冲区,一个用于复制内核数据,另一个用于固定内存。但是,这会导致使用双倍设备内存实现,因为内核和固定内存缓冲区都在设备上分配。

为了解决这个内存问题,我创建了一个实现,只需要根据需要对设备进行映射。这可以在以下实现中看到:

// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];

// initiate streams
void initialize()
{
    for( uint s = 0; s < STREAMS_MAX; s++ ){
        queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
        devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
    }
}

// main function called for all input images
void backproject(imgdata* r)
{
    const uint projNr = r->imgnr % BUFFER_SIZE;
    const uint streamNr = r->imgnr % STREAMS_MAX;

    // when buffer is filled, wait until work in current stream has finished
    if(projNr == 0) {
       status = clFinish(queue[streamNr]);
    }

    // map host memory region to device buffer
    hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);

    // copy received image data to hostbuffers
    std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);

    // unmap the allocated pinned host memory
    clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);   

    // set stream specific arguments
    clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);

    // launch kernel
    clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);

    clFlush(queue[streamNr]);
    clFinish(queue[streamNr]);   //should be removed!
}

此实现确实使用与CUDA实现类似的设备内存量。但是,在每个循环之后,我无法使用最后一个没有clFinish的代码示例,这会严重影响应用程序的性能。这表示当主机移动到内核之前时数据丢失。我尝试将缓冲区大小增加到输入图像的数量,但这也不起作用。所以在执行期间,hostBuffer数据会丢失。

因此,为了编写类似于CUDA的OpenCL代码,我有三个问题:

  1. OpenCL固定内存的推荐实现是什么?
  2. 我的OpenCL实现是否类似于CUDA处理固定内存的方式?
  3. 是什么导致在OpenCL示例中使用错误的数据?
  4. 提前致谢!

    亲切的问候,

    雷米


    PS:最初在Nvidia developer forums

    询问的问题

0 个答案:

没有答案