Question

我正在用CUDA编写程序，问题如下：

两个矩阵A（n * 128）和B（m * 128）
我取A的第一行，然后逐个计算该向量与B的所有行之间的距离。
我在矩阵C的一行上写下每个距离的结果，因此C的元素C（i，j）包含A的第i行和B的第j行之间的距离。

- 我继续下一行A。

我已经用这种方式实现了它：我有一个由（n * m）个块组成的网格，每个块有128个线程。（1 * 128）。

该程序正在编译，但问题是它没有提供良好的距离。我无法弄清楚有什么不对......

PS：我的CUDA 6.0配有NVIDIA GTX 650（copute capability 3.0）

 __global__ void EuclidianDistances( float *A, float *B , float *C , int n , int m)
{
    // SIZE is equal to 128
__shared__ float accumResult[SIZE];
__shared__ float sA[SIZE];
__shared__ float sB[SIZE];

    // MAPPING
int bx = blockIdx.x;  // n
int by = blockIdx.y;  // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1


sA[ty] = A [bx * SIZE + ty];
sB[ty] = B [by * SIZE + ty];
__syncthreads();


accumResult[ty] = (sA[ty] - sB[ty])*(sA[ty] - sB[ty]);
__syncthreads();


// Parallel tree-reduction
for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)
    if (ty < stride)
    {
        accumResult[ty] += accumResult [stride + ty];
          __syncthreads();
    }

    // Writing results to output matrix
if ((threadIdx.y == 0))
    C [bx * m + by] = accumResult[ty];
       __syncthreads();
}

Answer 1

条件看起来不对：

for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)

假设SIZE为128，如你所说，这将不会被执行。 if语句中的__synchthread也可能会拖延整个事情

编辑：在阅读OP的评论后，我意识到这是一个语言问题..这里是一个片段：

#include <iostream>
using namespace std;

int main() {

    int SIZE = 128;

    for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)
        cout << "Hello I'm running" << endl;



    return 0;
}

http://ideone.com/AyhXYF

输出结果是：什么都没有。看一下C ++中的for loop syntax，第二部分是在整个循环期间应该持续的条件。如果你从一个错误的条件开始，你的循环永远不会被执行。

计算CUDA中2个矩阵之间的欧几里德距离

1 个答案: