Question

我正在解决CUDA上的最小优势集问题。每个线程都会找到一些本地的结果，我需要找到最好的结果。我使用__device__变量作为全局结果（dev_bestConfig和dev_bestValue）。

我需要做这样的事情：

__device__ configType dev_bestConfig = 0;
__device__ int dev_bestValue = INT_MAX;

__device__ void findMinimalDominantSet(int count, const int *matrix, Lock &lock)
{
    // here is some algorithm that finds local bestValue and bestConfig

    // set device variables
    if (bestValue < dev_bestValue)
    {
        dev_bestValue = bestValue;
        dev_bestConfig = bestConfig;
    }
}

我知道这不起作用，因为更多线程同时访问内存所以我使用这个关键部分：

    // set device variables
    bool isSet = false;
    do
    {
        if (isSet = atomicCAS(lock.mutex, 0, 1) == 0)
        {
            // critical section goes here
            if (bestValue < dev_bestValue)
            {
                dev_bestValue = bestValue;
                dev_bestConfig = bestConfig;
            }
        }
        if (isSet)
        {
            *lock.mutex = 0;
        }
    } while (!isSet);

这实际上按预期工作，但非常慢。例如，没有这个关键部分需要0.1秒，而这个关键部分需要1.8秒。

我可以做些什么来加快速度？

Answer 1

我实际上避免了任何关键部分并在最后锁定。我将本地结果保存到数组中，然后搜索最佳结果。搜索可以顺序进行，也可以并行进行。

聚合CUDA线程的结果

1 个答案: