Question

在我的特定问题中，我需要通过一个插值公式对内核上具有自变量x的函数求值，该公式使用由主机内存中的数组保存的数据。该函数的参数仅在执行内核时才对每个内核可用，一旦该参数变得可用，我想停止执行内核并将相关数据块double8移至具体来说，是从主机到设备上以异步方式提供最佳性能优势（我认为最好是local Memory）的内存域。我想知道是否可以使用以及如何有效使用内存对象来实现此目标？我正在考虑以下OpenCL构造：

cl_mem clCreateBuffer( cl_context ctx,
                       cl_mem     flags,
                       syse_t     size,
                       void*      host_ptr,
                       cl_int     errorcode_ret)

err = clEnqueueWriteBuffer(
     command_queue,    // command queue managing the transaction
     output,           // buffer object to write to, could it be in local memory?
     CL_TRUE,          // indicating a blocking transfer
     0,                // offset in the output to start writing the data
     size,             // size of the data transfer  
     host_ptr,         // pointer to the buffer in host memory holding the data
     0,                // number of event, what could I do with this?
     NULL,             // number of events that predate the current one? the previous argument, I guess? 
     NULL,             // event object to return after successful completion
     );

在OpenCL Best Practices Guide之后，我想使用推荐的模式。

1）分别声明用于固定主机存储器和GPU设备GMEM的cl_mem缓冲对象，以及用于引用固定主机存储器的标准指针。

cl_context cxGPUContext;         // computational context for the current arena
cl_mem cmPinnedBufIn = NULL;     // memory buffer on the host-side 
cl_mem cmDevBufIn = NULL;        // memory buffer on the device-side 
unsigned char* cDataIn = NULL;   // holder for the data buffer

2）分别为固定的主机内存和GPU设备GMEM分配cl_mem缓冲对象，以准备事务：

cmPinnedBufIn = clCreateBuffer(cxGPUContext,                              // computational context for the current arena
                               CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,  // zero-copy I guess??
                               memSize,                                   // size of the stack of memory to hold the data transfers
                               NULL,                                      // initializing the host pointer to NULL
                               NULL);                                     // error code

cmDevBufIn = clCreateBuffer(cxGPUContext,     // computational context for the current arena
                            CL_MEM_READ_ONLY, // read-only mode in the device side 
                            memSize,          // size of the stack of memory to hold the transaction, it is in the Global memory I guess? 
                            NULL,             // initializing the device pointer to NULL
                            NULL);            // error code

3）映射标准指针，以使用标准指针引用固定的主机内存输入和输出缓冲区。

cDataIn = (double8 *)clEnqueueMapBuffer(cqCommandQue,  //command queue to manage the data transfer
                                        cmPinnedBufIn, // pinned buffer instantiated avobe
                                        CL_TRUE,       // is this a blocking map?
                                        CL_MAP_WRITE,  // what we are doing? 
                                        0,             // offset in the data 
                                        memSize,       // size of the transfer 
                                        0,             // number of events in the waiting list, howt to capitalize on this?
                                        NULL,          // waiting list  
                                        NULL,          // event
                                        NULL);         // error code

4）使用标准主机指针和标准主机代码初始化或更新固定的内存内容。这里，我定义了一个函数，用于从插值表中获取与之对应的double8自变量：

cDataIn = get_data(x);

5）在应用程序中只要已将“新”数据写入固定的主机存储器中，就可以将数据从固定的主机存储器中写入GPU设备GMEM。

err = clEnqueueWriteBuffer(
     cqCommandQue,     // command queue managing the transaction
     cmDevBufIn,       // buffer object to write to, could it be in local memory?
     CL_FALSE,         // indicating that this is not a blocking transfer
     0,                // offset in the output to start writing the data
     sizeof(double8),  // size of the data transfer  
     cDataIn,          // pointer to the buffer in host memory holding the data
     0,                // number of event, what could I do with this?
     NULL,             // number of events that predate the current one? the previous argument, I guess? 
     NULL,             // event object to return after successful completion
     );

6）在GPU设备上运行计算内核。 在这一步中，我的应用程序打破了更多标准计算的模式，即需要传输的数据块才能完成计算取决于此处x的值。例如让我们说内核类似于此question中的内核：

//d_kernel.cl

__kernel void distance_kernel(__global double *pixelInfo,
                                __global double *clusterCentres,
                                __global double *distanceFromClusterCentre)
{
    int index = get_global_id(0);

    int d, dl, da, db, dx, dy;

    dl = pixelInfo[5 * index] - clusterCentres[0];
    dl = dl * dl;

    da = pixelInfo[5 * index + 1] - clusterCentres[1];
    da = da * da;

    db = pixelInfo[5 * index + 2] - clusterCentres[2];
    db = db * db;

    dx = pixelInfo[5 * index + 3] - clusterCentres[3];
    dx = dx * dx;

    dy = pixelInfo[5 * index + 4] - clusterCentres[4];
    dy = dy * dy;

    double x = dx + dy + dl + da + db;
    // how could I grab from the host the data corresponding to
    // the value taken by x above?
    // let ussupose that the  inlined function get_data(x) does it
    // 'transparently' :)
    double8 point = get_data(x);
    // use point to compute y by interpolation
    double y = interp(x,point);
    istanceFromClusterCentre[index] = y;

}

我只想尽可能高效地进行Host-to-Device转移。正如Best practices Guide所说：

通过非阻塞的读或写传输，返回控制立即进入主机线程，从而允许进行操作在主机线程中同时进行设备继续运行。

我天真地迷恋是解决这个问题的最好方法，不是吗？我主要关心的是如何实现可从内核调用的例程get_data(x)，以有效管理Host-to-Device数据传输？我也很担心，因为固定内存的分配可能需要相当长的时间，如何通过在执行开始时分配一堆内存来减轻这种损失？良好堆栈大小的最佳猜测是什么？

OpenCL规范中的以下语句使我认为我所要求的是可能的：

全局内存。该存储区允许对所有存储区进行读/写访问所有工作组中的工作项目。工作项目可以读取或写入内存对象的任何元素。读写全局内存可能根据设备的功能进行缓存。

但是我还不知道如何触发内核的数据传输。就是get_data(double x)的实现方式。

是否由OpenCL中的内核管理异步数据传输？

0 个答案: