Question

我是GPU编程的新手。最近，我正在尝试基于教程http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/实现gpu bvh构造算法。在该算法的第一步中，计算并排序每个基元的morton代码（unsigned int）。本教程给出了12K对象的morton代码计算和排序的参考时间成本：

0.02 ms，每个对象一个线程：计算边界框并指定Morton代码。
0.18 ms，并行基数排序：根据Morton代码对对象进行排序。 ...

在我的实施中，第一步成本为0.1毫秒，分拣步骤成本为1.8毫秒。我正在使用推力进行排序。那么GPU上基数排序的最快实现是什么？

我正在使用Geforce Titan GPU，它应该比本教程作者使用的Geforce GTX690更快。这是我的排序测试代码，即使大小为10，也要花费大约1.5ms。

void testSort()
{
    int sz = 10;
    thrust::host_vector<unsigned int> h_keys(sz);
    for(int i=0; i<sz; i++)
        h_keys[i] = rand();
    thrust::device_ptr<unsigned int> keys = thrust::device_malloc<unsigned int>(sz);
    thrust::copy(h_keys.begin(),h_keys.end(),keys);
    cudaEvent_t estart, estop;
    cudaEventCreate( &estart );
    cudaEventCreate( &estop );
    cudaEventRecord( estart, 0 );
    thrust::stable_sort(keys,keys+sz);
    cudaEventRecord( estop, 0 ) ;
    cudaEventSynchronize( estop );
    float elapsedTime;
    cudaEventElapsedTime( &elapsedTime,
        estart, estop ) ;
    printf( "Time to sort: %3.1f ms\n", elapsedTime );
    cudaEventDestroy( estart ) ;
    cudaEventDestroy( estop ) ;
}

Answer 1

back40computing为GPGPU提供了Radix排序实现。他们提供了一个性能对比图表，声称他们的实现速度最快。

push :: sort的速度有多快，什么是最快的基数排序实现

1 个答案: