Question

我正在评估CUDA，目前使用Thrust库对数字进行排序。

我想为thrust :: sort创建自己的比较器，但它会大幅减速！我只是从 functional.h 复制代码，创建了自己的 less 实现。然而，它似乎是以其他方式编译并且工作非常缓慢。

默认比较器：thrust :: less（） - 94 ms
我自己的比较器：less（） - 906 ms

我正在使用Visual Studio 2010.我应该怎样做才能获得与选项1相同的性能？

完整代码：

#include <stdio.h>

#include <cuda.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>

int myRand()
{
        static int counter = 0;
        if ( counter++ % 10000 == 0 )
                srand(time(NULL)+counter);
        return (rand()<<16) | rand();
}

template<typename T>
struct less : public thrust::binary_function<T,T,bool>
{
  __host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
     return lhs < rhs;
  }
}; 

int main()
{
    thrust::host_vector<int> h_vec(10 * 1000 * 1000);
    thrust::generate(h_vec.begin(), h_vec.end(), myRand);

    thrust::device_vector<int> d_vec = h_vec;

    int clc = clock();
    thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
    printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC);

    return 0;
}

Answer 1

您观察性能差异的原因是因为Thrust正在使用不同的算法实现排序，具体取决于提供给thrust::sort的参数。

在案例1中，Thrust可以证明排序可以使用基数排序在线性时间内实现。这是因为要排序的数据类型是内置数值类型（int），而比较函数是内置的小于操作 - Thrust认识到thrust::less<int>将生成与x < y相同的结果。

在案例2中，Thrust对用户提供的less<int>一无所知，并且必须使用基于具有不同渐近复杂度的比较排序的更保守的算法，即使实际上是{{1} }相当于less<int>。

通常，用户定义的比较运算符不能与更严格，更快速的排序一起使用，这些排序操纵数据的二进制表示，例如基数排序。在这些情况下，Thrust会回归到更普遍但更慢的类型。

快速CUDA推力定制比较运算符

1 个答案: