Question

我有以下问题，我想知道哪种方法最好：

__kernel void test(__global int* output){

    // ... Code execution to define myValue.

    // V1 : Threads are idle and wait for the first Work-item to write
    // the output value.
    if(get_global_id(0) == 0) output[0] = myValue;

    // V2 : All work-items perform the same action and try to write in
    // same global memory. (Is there any lock ?)
    output[0] = myValue;    
}

两者都在使用我的AMD GPU。但我不知道哪一个是最好的方法。

编辑：

根据kanna回答，我添加了更多代码以获取更多信息（因为我正在研究它，它会随时更新）。

我的目标是跟踪每个内核的head / next_head，并在工作组之间保持内存块指针的一致性。

在第一种方法中，我直接在全局内存ptr中修改头部，这会在工作组编号较高时出现问题，出现块位置的反同步，并使用以下代码，它是＆＃39; s似乎一切都按预期运行，并且每个工作组访问相同的块ptr，尽管代码执行后来使用基于get_global_id的块。

因此，我正在寻找OpenCL良好做法来增强该代码，并确保我没有任何瓶颈＆＃39;在将来。如果是，请随时就以下代码提出建议。

__global void* malloc(size_t sizePtr, __global uchar* heap, ulong* head){
    // Get the new ptr inside the heap
    __global void* ptr = heap + head[0];

    // Increment the head.
    head[0] = head[0] + sizePtr;

    return ptr;
}

__kernel void test(__global uchar* heap,
                   __global ulong* head,
                   __global ulong* next){

     // Each work-item set its own local head based on the
     // global variable. So every thread in any work-group
     // will start at the same head in the heap.
     ulong local_head = head[0];

     // If get_global_size(0) is 1000. We allocate 1000 + 4000.
     const uint g_size = get_global_size(0);

     // Get pointers in a Huge memory block (heap) which allows
     // to have less memory transfer in-between kernel.
     // Just need to keep track of them (work in-progess).
     __global uchar* block1 = malloc(sizeof(uchar) * g_size , heap, &local_head);
     __global int* block2 = malloc(sizeof(int) * g_size , heap, &local_head);

     // Process the blocks in here, access them via the get_global_id(0)
     // as index. 

     // V1
     if(get_global_id(0) == 0) next[0] = local_head;

     // V2
     next[0] = local_head;

    // If head was 0, the next is now 5000 for all the work-items, 
    // whenever the work-group they are in.
}

Answer 1

在基于warp的GPU中，肯定V1更好。

V1的优点是提前终止所有其他warp并减少内存流量。

OpenCL中没有锁，甚至无法使用原子操作构建自己的锁。

OpenCL：动态内存分配，是否更好地使用空闲工作项或同时写入

1 个答案: