Question

我有几个块，每个块在整数数组的不同部分上执行。例如：阻塞一个从数组[0]到数组[9]，阻塞二从数组[10]到数组[20]。

我可以获得每个块的数组最大值索引的最佳方法是什么？

示例块a a [0]到[10]具有以下值：
5 10 2 3 4 34 56 3 9 10

所以56是指数6的最大值。

我无法使用共享内存，因为数组的大小可能非常大。因此它不适合。是否有任何图书馆可以让我这么快？

我知道缩减算法，但我认为我的情况不同，因为我想得到最大元素的索引。

Answer 1

如果我完全理解你想要的是：获取其中最大值的数组A的索引。

如果确实如此，我建议您使用推力库：

您将如何做到这一点：

#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>

using namespace thrust;

// return the biggest of two tuples
template <class T>
struct bigger_tuple {
    __device__ __host__
    tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b) 
    {
        if (a > b) return a;
        else return b;
    } 

};

template <class T>
int max_index(device_vector<T>& vec) {

    // create implicit index sequence [0, 1, 2, ... )
    counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
    tuple<T,int> init(vec[0],0); 
    tuple<T,int> smallest;

    smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
                      init, bigger_tuple<T>());
    return get<1>(smallest);
}

int main(){

    thrust::host_vector<int> h_vec(1024);
    thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices

    // transfer data to the device
    thrust::device_vector<int> d_vec = h_vec;

    int index = max_index(d_vec);

    std::cout <<  "Max index is:" << index <<std::endl;
    std::cout << "Value is: " << h_vec[index] <<std::endl;

    return 0;
}

Answer 2

这对原版海报没有好处，但是对于那些来到这个页面寻找答案的人我会先推荐使用已经具有函数推力的推力:: max_element就是这样做 - 返回最大的索引元件。还提供了min_element和minmax_element函数。有关详细信息，请参阅推文文档here。

Answer 3

除了使用Thrust的建议外，您还可以使用CUBLAS cublasIsamax函数。

Answer 4

与共享内存相比，数组的大小几乎无关紧要，因为每个块中的线程数是限制因素而不是数组的大小。一种解决方案是让每个线程块在与线程块大小相同的数组大小上工作。也就是说，如果你有512个线程，那么块n将查看array [n]到array [n + 511]。每个块都进行缩减以找到阵列该部分中的最高成员。然后，将每个部分的最大值返回到主机，并进行简单的线性搜索，以找到整个数组中的最高值。每次减少都没有GPU将线性搜索减少512倍。根据数组的大小，您可能希望在恢复数据之前进行更多的减少。（如果您的阵列大小为3 * 512 ^ 10，您可能希望对gpu进行10次减少，并让主机搜索剩余的3个数据点。）

Answer 5

在进行最大值加索引缩减时要注意的一件事是，如果数组中有多个相同值的最大元素，即在您的示例中，如果有2个或更多值等于56，则返回的索引不是唯一的，并且可能在每次运行代码时都不同，因为GPU上的线程排序时间不确定。

要解决此类问题，您可以使用唯一的排序索引，例如threadid + threadsperblock * blockid，或者如果它是唯一的，则使用元素索引位置。那么最大的测试就是沿着这些方向：

if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{ 
    max_so_far = a;
    index_max_so_far = index_a;
    order_max_so_far = order_a;
}

（索引和顺序可以是相同的变量，具体取决于应用程序。）

CUDA：在数组中获取最大值及其索引

5 个答案: