Question

我将以下代码作为重组数据的一部分，以便以后在CUDA内核中使用：

thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(dev_particle_cell_indices);
int total = 0;
for(int i = 0; i < num_cells; i++) {
    particle_offsets[i] = total;
    // int num = 0;
    int num = thrust::count(dev_ptr, dev_ptr + num_particles, i);
    particle_counts[i] = num;
    total += num;
}

现在，如果我将num设置为0（取消注释第5行，并注释掉第6行），则应用程序以超过30 fps的速度运行，这是我的目标。但是，当我将num设置为等于thrust::count时，帧速率会下降到大约1-2 fps。为什么会这样？

我的理解是，推力应该是高度优化的算法的集合，利用GPU的强大功能，所以我很惊讶它会对我的程序性能产生这样的影响。这是我第一次使用推力，所以我可能没有意识到一些重要的细节。

在循环中使用thrust::count会导致它运行得如此之慢吗？如何优化我对它的使用？

为了给出一些数据，在我目前的测试用例中，num_particles约为2000，num_cells约为1500。

Answer 1

我必须坦率地说，这将是

批评推力
炫耀ArrayFire（我是其核心开发人员）

对推力的批评

他们在优化矢量输入的并行算法方面做得很好。他们使用数据级并行（以及其他内容）来区分对大，向量输入非常有效的算法。但是他们没有对其进行改进，并且一直在实现 true 数据级并行性。即一个 大个小问题。

第二种情况在许多实际应用中都很有用，而ArrayFire在这方面提供了解决方案（看gfor，并行for循环）。

插入ArrayFire

应该简单地调用缩减和扫描，而不是4种算法（其中一种是昂贵的排序）和3种内存副本。

以下是代码在ArrayFire中的工作原理：

array cell_indices(num_particles, 1, dev_particle_cell_indices, afDevicePointer);
array particle_counts = zeros(num_cells);

gfor(array i, num_cells) // Parallel for loop
        particle_counts(i) = sum(cell_indices == i);

array particle_offsets = accum(particle_counts); // Inclusive sum

-

<强>设置

我使用talonmies代码来对阵arrayfire。
我在Linux 64上使用类似的显卡（gts 360m）（cuda 4.1 / gcc 4.7）。
您可以在here上找到完整的基准代码。

基准1

使用num_particles = 2000和num_cells = 1500（与原始问题一样）

$ ./a.out 
Thrust time taken: 0.002384
ArrayFire time taken: 0.000131

ArrayFire 18 快一点

基准2

使用num_particles = 10000和num_cells = 2000（就像talonmies的测试用例一样）

$ ./a.out 
Thrust time taken: 0.002920
ArrayFire time taken: 0.000132

ArrayFire 22 快一点

基准3

使用num_particles = 50000和num_cells = 5000（只是一个更大的测试用例）

$ ./a.out 
Thrust time taken: 0.003596
ArrayFire time taken: 0.000157

ArrayFire 23 快

备注

Thrust要求您重写代码

Thrust比原始代码提供~320倍的速度

ArrayFire几乎不需要重写代码（更改为gfor）

ArrayFire的速度提高了18-23倍（实际上比原始代码大约7300倍）

ArrayFire更好地扩展（推力的运行时间增加50％，ArrayFire的运行时间增加15％）

<强>结论

如果您可以重新编写问题，推力确实可以提供适当的加速。但这并不总是可行的，对于更复杂的问题来说并非易事。这些数字表明存在更高性能的余地（因为数据并行性程度很高），而这种情况根本没有得到推动。

ArrayFire以更有效的方式利用并行资源，时间表明gpu仍未饱和。

您可能想要编写自己的自定义cuda代码或使用ArrayFire。我只想指出，有时使用推力不是一种选择，因为它在大量的小问题上几乎没用。

编辑修复了基准1的结果（我使用了错误的数字）

Answer 2

thrust::count的性能很好，这是你尝试使用它的方式，这对性能有问题。如果你有很多粒子而且只有几个单元格，那么使用thrust::count进行实现可能不是一个坏主意。你的问题是你有1500个细胞。这意味着每次要进行计算时，1500 count和1500设备的调用将主机内存传输。正如您所发现的，所有内核启动和所有PCI-e总线副本的延迟都会降低性能。

对于大量细胞更好的方法是这样的：

thrust::device_ptr<int> rawin = thrust::device_pointer_cast(dev_particle_cell_indices);

// Sort a scratch copy of the cell indices by value
thrust::device_vector<int> cidx(num_particles);
thrust::copy(rawin, rawin+num_particles, cidx.begin());
thrust::sort(cidx.begin(), cidx.end());

// Use binary search to extract all the cell counts/offsets
thrust::counting_iterator<int> cellnumber(0);
thrust::device_vector<int> offsets(num_cells), counts(num_cells);

// Offsets come from lower_bound of the ordered cell numbers
thrust::lower_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, offsets.begin());

// Counts come from the adjacent_difference of the upper_bound of the ordered cell numbers
thrust::upper_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, counts.begin());
thrust::adjacent_difference(counts.begin(), counts.end(), counts.begin());

// Copy back to the host pointer
thrust::copy(counts.begin(), counts.end(), particle_counts);
thrust::copy(offsets.begin(), offsets.end(), particle_offsets);

在这里，我们首先对单元格索引的本地副本进行排序，然后使用推力二进制搜索函数执行与代码相同的操作，但是通过GPU内存中的数据传递的次数要少得多，而只有两个设备来存储内存复制以将所有结果返回给主机。

当我使用上面发布的代码对thrust::count实现进行基准测试以获得非平凡的情况（在OS X上使用CUDA 4.1的GeForce 320M上有10000个随机粒子和2000个单元格），我发现你的版本需要运行大约0.95秒，而排序/搜索版本运行大约需要0.003秒。因此，如果您使用更有效的策略和更合适的算法，使用推力可能会有几百倍的加速。

推力表现::计数

2 个答案: