Question

我正在为我的cuda计划寻找优化策略。在我的内核的for循环内的每次迭代中，每个线程产生一个分数。我维护分数的共享优先级队列，以保持每个块的前k个。请参阅下面的伪代码：

__global__ gpuCompute(... arguments)
{
    __shared__ myPriorityQueue[k];  //To maintain top k scores ( k < #threads in this block)
    __shared__ scoresToInsertInQueue[#threadsInBlock];
    __shared__ counter;
    for(...)       //About 1-100 million iterations
    {
        int score = calculate_score(...);
        if(score > Minimum element of P. Queue && ...)
        {
            ATOMIC Operation : localCounter = counter++;
            scoresToInsertInQueue[localCounter] = score;
        }
        __syncthreads();
        //Merge scores from scoresToInsertInQueue to myPriorityQueue
        while(counter>0)
        {
            //Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
            counter--;  
            __syncthreads(); 
        }
        __syncthreads();
    }
}

希望上面的代码对你们有意义。现在，我正在寻找一种方法来删除原子操作开销s.t.每个帖子都会保存＆＃39; 1＆＃39;或者＆＃39; 0＆＃39; 0取决于值应该转到优先级队列或不。我想知道在内核中是否有任何流压缩的实现，这样我就可以减少＆＃39; 1000000000100000000＆＃39;到＆＃39; 11000000000000000000＆＃39;缓冲区（或知道＆＃39; 1＆＃39; s的索引），最后在队列中插入对应于＆＃39; 1的分数。
请注意，在这种情况下，＆＃39; 1会非常稀疏。

Answer 1

如果那些非常稀疏，atomic方法可能是最快的。然而，我在这里描述的方法将具有更可预测和有限的最坏情况性能。

对于决策数组中的1和0的良好混合，使用并行扫描或prefix-sum从决策数组中构建插入点索引数组可能更快：

假设我有一个选择分数的阈值决定＆gt; 30进入队列。我的数据可能如下所示：

scores:     30  32  28  77  55  12  19
score > 30:  0   1   0   1   1   0   0
insert_pt:   0   0   1   1   2   3   3    (an "exclusive prefix sum")

然后每个线程按如下方式进行存储选择：

if (score[threadIdx.x] > 30) temp[threadIdx.x] = 1;
else temp[threadIdx.x] = 0;
__syncthreads();
// perform exclusive scan on temp array into insert_pt array
__syncthreads();
if (temp[threadIdx.x] == 1)
  myPriorityQueue[insert_pt[threadIdx.x]] = score[threadIdx.x];

CUB具有快速并行前缀扫描。

在cuda内核中进行流压缩以维护优先级队列

1 个答案: