Question

由于Thrust库存在一些性能问题（有关详细信息，请参阅this page），我计划将CUDA应用程序重新分解为使用CUB而不是Thrust。具体来说，要替换thrust :: sort_by_key和thrust :: inclusive_scan调用）。在我的应用程序的特定点，我需要按键排序3个数组。这就是我用推力做到这一点的方式：

thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys, 
      thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
      thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);

其中

key iter是一个thrust :: device_ptr，指向我想要排序的键
indices指向设备内存中的序列（从0到numKeys-1）
values{1,2,3}Ptr是我想要排序的值
values{1,2,3}OutPtr是排序值

使用CUB SortPairs函数，我可以对单个值缓冲区进行排序，但不能一次性对所有3个缓冲区进行排序。问题是我没有看到任何CUB“类似聚集”的实用程序。建议？

编辑：

我想我可以实现自己的聚集内核，但是有更好的方法可以做到这一点：

template <typename Index, typename Value> 
__global__ void  gather_kernel(const unsigned int N, const Index * map, 
const Value * src, Value * dst) 
{ 
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; 
    if (i < N) 
    { 
        dst[i] = src[map[i]]; 
    } 
}

非coalesed加载和存储让我感到愤怒，但如果没有map上的已知结构，它可能是不可避免的。

Answer 1

您希望实现的目标取决于thrust::zip_iterator。你可以

仅将thrust::sort_by_key替换为cub::DeviceRadixSort::SortPairs并保留thrust::gather或

values{1,2,3}

zip cub::DeviceRadixSort::SortPairs到结构数组中
更新

阅读thrust::gather的实施，
```
$CUDA_HOME/include/thrust/system/detail/generic/gather.inl
```
你可以看到它只是一个天真的内核，如
```
__global__ gather(int* index, float* in, float* out, int len) {
  int i=...;
  if (i<len) { out[i] = in[index[i]]; }
}
```
然后我认为上面的代码可以用一个内核替换而不需要太多努力。

在这个内核中，您可以首先使用CUB block-wize原语cub::BlockRadixSort<...>::SortBlockedToStriped来获取存储在寄存器中的已排序索引，然后执行一个天真的重新排序副本thrust::gather来填充{{1} }}

复制values{1,2,3}Out时，使用SortBlockedToStriped而非Sort可以进行合并写入（不是为了阅读）。

CUB（CUDA UnBound）相当于推力::聚集

1 个答案:

更新