Question

我正在使用c ++和cuda / thrust在GPU上执行计算，这对我来说是一个新领域。不幸的是，我的代码（下面的MCVE）效率不高，所以我想知道如何优化它。该代码执行以下操作：

有两个关键向量和两个值向量。关键向量基本上包含上三角矩阵的i和j（在这个例子中：大小为4×4）。

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 3, 2, 3, 3} value2: {-1, 2.0, -3.5, 2.0, -3.5, -3.5}

任务是对具有相同键的所有值求和。为此，我使用sort_by_key对第二个值向量进行了排序。结果是：

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 2, 3, 3, 3} value2: {-1.0, 2.0, 2.0, -3.5, -3.5, -3.5}

之后，我使用merge_by_key合并了两个值向量，这导致了一个新的键和值向量，其大小为double，比以前大。

key_merge {0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3}
value_merge {0.5, 0.5, 0.5, -1.0, -1.0, -1.0, 2.0, 2.0, 2.0, -3.5, -3.5, -3.5}

最后一步是使用reduce_by_key对具有相同键的所有值求和。结果是：

key {0, 1, 2, 3} value: {1.5, -3.0, 6.0, -10.5}

执行此操作的下面的代码很慢，我担心较大尺寸的性能会很差。如何优化？是否可以融合sort_by_key，merge_by_key和reduce_by_key？由于我事先知道了sort_by_key中生成的关键向量，是否可以将值向量＆＃34;从旧关键字转换为新关键字＆＃34;？在减少它们之前合并两个向量是否有意义，或者对于每对值/向量向量分别使用reduce_by_key是否更快？是否可以通过使用这样的事实来加速reduce_by_key计算：这里不同键值的数量是已知的并且相等键的数量总是相同的？

#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/merge.h>

int main(){
   int key_1[6] = {0, 0, 0, 1, 1, 2};
   int key_2[6] = {1, 2, 3, 2, 3, 3};
   thrust::device_vector<double> k1(key_1,key_1+6);
   thrust::device_vector<double> k2(key_2,key_2+6);

   double value_1[6] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0};
   double value_2[6] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5};
   thrust::device_vector<double> v1(value_1,value_1+6);
   thrust::device_vector<double> v2(value_2,value_2+6);

   thrust::device_vector<double> mk(12);
   thrust::device_vector<double> mv(12);
   thrust::device_vector<double> rk(4);
   thrust::device_vector<double> rv(4);

   thrust::sort_by_key(k2.begin(), k2.end(), v2.begin());
   thrust::merge_by_key(k1.begin(), k1.end(), k2.begin(), k2.end(),v1.begin(), v2.begin(), mk.begin(), mv.begin());
   thrust::reduce_by_key(mk.begin(), mk.end(), mv.begin(), rk.begin(), rv.begin());

   for (unsigned i=0; i<4; i++) {
     double tmp1 = rk[i];
     double tmp2 = rv[i];
     printf("key value %f is related to %f\n", tmp1, tmp2);
   }
   return 0;
}

结果：

key value 0.000000 is related to 1.500000
key value 1.000000 is related to -3.000000
key value 2.000000 is related to 6.000000
key value 3.000000 is related to -10.500000

Answer 1

我认为这是一种可能比你的序列更快的方法。关键的想法是，我们希望避免在我们提前了解订单的地方对数据进行排序。如果我们可以利用我们拥有的订单知识，而不是对数据进行排序，我们可以简单地将其重新排序为所需的排列。

让我们对数据做一些观察。如果您的key1和key2实际上是上三角矩阵的i，j索引，那么我们可以对这两个向量的连接进行一些观察：

连接的向量将包含相同数量的每个键。（我相信你可能已经在你的问题中指出了这一点。）所以在你的情况下，向量将包含三个0个键，三个1个键，三个2个键和三个{ {1}}键。我相信这种模式应该适用于任何独立于矩阵维度的上三角形图案。因此，上三角形的维N的矩阵在连接索引向量中将具有N组密钥，每组由N-1个相似元素组成。
在连接向量中，我们可以发现/建立一致的键排序（基于矩阵维N），这允许我们以类似键分组的顺序对向量进行重新排序，而不需要求助于传统的排序操作

如果我们结合上述两个想法，那么我们可以通过一些分散操作来解决整个问题，以替换排序/合并活动，然后进行3操作。分散操作可以使用thrust::reduce_by_key与适当的thrust::copy结合适当的索引计算函数来完成。由于我们确切地知道重新排序的连接thrust::permutation_iterator向量的外观（在您的维度4示例中：key），因此我们无需在其上显式执行重新排序。但是，我们必须使用相同的映射重新排序{0,0,0,1,1,1,2,2,2,3,3,3}向量。因此，让我们为该映射开发算法：

value

我们可以观察到，在每种情况下，目的地索引（即，将所选键或值移动到所需组顺序的位置）等于组起始索引加上组偏移索引。组起始索引只是密钥乘以（N-1）。组偏移索引是类似于上三角索引模式或下三角索引模式的模式（对于连接向量的每一半，在2个不同的化身中）。连接键只是dimension (N=)4 example vector index: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11 desired (group) order: 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3 concatenated keys: 0, 0, 0, 1, 1, 2, 1, 2, 3, 2, 3, 3 group start idx: 0, 0, 0, 3, 3, 6, 3, 6, 9, 6, 9, 9 group offset idx: 0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2 destination idx: 0, 1, 2, 3, 4, 6, 5, 7, 9, 8,10,11 dimension (N=)5 example vector index: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19 desired (group) order: 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4 concatenated keys: 0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4 group start idx: 0, 0, 0, 0, 4, 4, 4, 8, 8,12, 4, 8,12,16, 8,12,16,12,16,16 group offset idx: 0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 3, 2, 1, 0, 3, 2, 1, 3, 2, 3 destination idx: 0, 1, 2, 3, 4, 5, 6,10, 7, 8,11,14, 9,12,15,17,13,16,18,19和key1向量的连接（我们将使用key2虚拟地创建此连接）。所需的组顺序是先验已知的，它只是一组整数组，其中N组各自由N-1个元素组成。它等同于连接键向量的排序版本。因此，我们可以直接在仿函数中计算目标索引。

为了创建组偏移索引模式，我们可以减去你的两个关键向量（并减去另外1个）：

permutation_iterator

这是一个完整的示例，使用您的示例数据演示上述概念：

key2:                  1, 2, 3, 2, 3, 3
key1:                  0, 0, 0, 1, 1, 2
key1+1:                1, 1, 1, 2, 2, 3
p1 = key2-(key1+1):    0, 1, 2, 0, 1, 0
p2 = (N-2)-p1:         2, 1, 0, 2, 1, 2
grp offset idx=p1|p2:  0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2

最终效果是，您的$ cat t1133.cu #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/copy.h> #include <thrust/transform.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/iterator/permutation_iterator.h> #include <thrust/iterator/zip_iterator.h> #include <thrust/iterator/counting_iterator.h> #include <iostream> // "triangular sort" index generator struct idx_functor { int n; idx_functor(int _n): n(_n) {}; template <typename T> __host__ __device__ int operator()(const T &t){ int k1 = thrust::get<0>(t); int k2 = thrust::get<1>(t); int id = thrust::get<2>(t); int go,k; if (id < (n*(n-1))/2){ // first half go = k2-k1-1; k = k1; } else { // second half go = n-k2+k1-1; k = k2; } return k*(n-1)+go; } }; const int N = 4; using namespace thrust::placeholders; int main(){ // useful dimensions int d1 = N*(N-1); int d2 = d1/2; // iniitialize keys int key_1[] = {0, 0, 0, 1, 1, 2}; int key_2[] = {1, 2, 3, 2, 3, 3}; thrust::device_vector<int> k1(key_1, key_1+d2); thrust::device_vector<int> k2(key_2, key_2+d2); // initialize values double value_1[] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}; double value_2[] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5}; thrust::device_vector<double> v(d1); thrust::device_vector<double> vg(d1); thrust::copy_n(value_1, d2, v.begin()); thrust::copy_n(value_2, d2, v.begin()+d2); // reorder (group) values by key thrust::copy(v.begin(), v.end(), thrust::make_permutation_iterator(vg.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(k1.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::make_permutation_iterator(k2.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::counting_iterator<int>(0))), idx_functor(N)))); // sum results thrust::device_vector<double> rv(N); thrust::device_vector<int> rk(N); thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/(N-1)), thrust::make_transform_iterator(thrust::counting_iterator<int>(d1), _1/(N-1)), vg.begin(), rk.begin(), rv.begin()); // print results std::cout << "Keys:" << std::endl; thrust::copy_n(rk.begin(), N, std::ostream_iterator<int>(std::cout, ", ")); std::cout << std::endl << "Sums:" << std::endl; thrust::copy_n(rv.begin(), N, std::ostream_iterator<double>(std::cout, ", ")); std::cout << std::endl; return 0; } $ nvcc -std=c++11 -o t1133 t1133.cu $ ./t1133 Keys: 0, 1, 2, 3, Sums: 1.5, -3, 6, -10.5, $和thrust::sort_by_key操作已被单个thrust::merge_by_key操作取代，这应该更有效。

Cuda Thrust - 如何使用sort_by_key，merge_by_key和reduce_by_key优化代码

1 个答案: