Question

假设你有一个函数可以接受一个向量，一组向量，并找出向量集中哪个向量最接近原始向量。如果我包含一些代码可能会有用：

int findBMU(float * inputVector, float * weights){


    int count = 0;
    float currentDistance = 0;
    int winner = 0;
    float leastDistance = 99999;

    for(int i = 0; i<10; i++){
        for(int j = 0;j<10; j++){
            for(int k = 0; k<10; k++){

                int offset = (i*100+j*10+k)*644;
                for(int i = offset; i<offset+644; i++){
                    currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
                    count++;
                }
                currentDistance = sqrt(currentDistance);

                count = 0;
                if(currentDistance<leastDistance){
                    winner = offset;

                    leastDistance = currentDistance;

                }
                currentDistance = 0;
            }
        }
    }
    return winner;
}

在此示例中，weights是一维数组，其中644个元素的块对应于一个向量。 inputVector是要比较的向量，它还有644个元素。

为了加快我的计划，我决定看一下NVIDIA提供的CUDA框架。这是我的代码看起来像我改变它以符合CUDA的规范。

__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){




    int i = threadIdx.x+(blockIdx.x*blockDim.x);

    if(i<1000){

        int offset = i*644;
        int count = 0;
        float currentDistance = 0;
        for(int w = offset; w<offset+644; w++){
            currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));

            count++;
        }


        currentDistance = sqrt(currentDistance);

        count = 0;
        if(currentDistance<*leastDistance){
            *winner = offset;

            *leastDistance = currentDistance;

        }
        currentDistance = 0;
    }

}

要调用该函数，我使用了：findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);

但是，当我打电话给这个功能时，有时它会给我正确的答案，有时它不会。在做了一些研究之后，我发现CUDA在这些减少问题上存在一些问题，但是我找不到如何修复它。如何修改我的程序以使其与CUDA一起使用？

Answer 1

问题是并发运行的线程会看到相同的leastDistance并覆盖彼此的结果。线程之间共享两个值; leastDistance和winner。你有两个基本选择。您可以写出所有线程的结果，然后通过并行缩减对数据进行第二次传递，以确定哪个向量具有最佳匹配，或者您可以使用atomicCAS()使用自定义原子操作实现此操作。

第一种方法最简单。我的猜测是它也会给你最好的性能，虽然它确实为免费的Thrust库增加了依赖性。您可以使用thrust::min_element()。

使用atomicCAS()的方法使用atomicCAS()具有64位模式的事实，您可以在其中将您希望的任何语义分配给64位值。在您的情况下，您将使用32位来存储leastDistance和32位来存储winner。要使用此方法，请在“CUDA C编程指南”中修改此示例，该指南实现双精度浮点atomicAdd()。

__device__ double atomicAdd(double* address, double val)
{
  unsigned long long int* address_as_ull =
  (unsigned long long int*)address;
  unsigned long long int old = *address_as_ull, assumed;
  do {
    assumed = old;
    old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
  } while (assumed != old);
  return __longlong_as_double(old);
}

使用CUDA查找集合中的最小数字

1 个答案: