  • 两个矩阵A(n * 128)和B(m * 128)

  • 我取A的第一行,然后逐个计算该向量与B的所有行之间的距离。

  • 我在矩阵C的一行上写下每个距离的结果,因此C的元素C(i,j)包含A的第i行和B的第j行之间的距离。

  • 然后我继续下一行A。

我用这种方式实现了它:我有一个由(n * m)个块组成的网格,每个块有128个线程。 (1 * 128)。



 __global__ void EuclideanDistances( float *A, float *B , float *C , int n , int m)
    // SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;

    // MAPPING
int bx = blockIdx.x;  // n
int by = blockIdx.y;  // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1

sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];

accumResult[ty] = (sA - sB) * (sA - sB);

// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1)
    if (ty < stride)
        accumResult[ty] += accumResult [stride + ty];

    // Writing results to output matrix
if ((threadIdx.y == 0))
    C [bx * m + by] = accumResult[ty];





n/8 m/8块网格



__global__ void EuclideanDistances( float *A, float *B , float *C, int n , int m)
    __shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];

int bx = blockIdx.x;  // n / 8
int by = blockIdx.y;  // m / 8
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = bx * tx * SIZE + ty;
int j = by * tx * SIZE + ty;

sA[ty][tx] = A [i];
sB[ty][tx] = B[j];

accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);

// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1)
    if (ty < stride)
        accumResult[ty][tx] += accumResult [stride + ty][tx];

    C[bx *  m + by] = accumResult[0][tx];


    int main()
     int m = 20000; //MatrixA size : m * SIZE
     int n = 4000;  //MatrixB size : n * SIZE


     // Host Allocations
     float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
     for(int i=0; i < n * SIZE; i++)
         matrixA[i] = (float) (rand()%100)+1;

     float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
     for(int i=0; i < m * SIZE; i++)
         matrixB[i] = (float) (rand()%100)+1;

     float *results_kernel1 = (float *) malloc (n * m * sizeof(float));
     float *results_kernel2 = (float *) malloc (n * m * sizeof(float));

     //Device Allocation
     float *d_matrixA;
     float *d_matrixB;
     cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
     cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
     cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
     cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);

     float *d_results_kernel1;
     float *d_results_kernel2;
     cudaMalloc((void **)&d_results_kernel1 , n * m * sizeof(float));
     cudaMalloc((void **)&d_results_kernel2 , n * m * sizeof(float));

     dim3 threads1 (1 , 128);
     dim3 blocks1  (n , m);
     EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel1 , n , m);
     cudaMemcpy(results_kernel1 , d_results_kernel1 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);

     dim3 threads2 (8 , 128);   // 1024 threads per block (maximum)
     dim3 blocks2  (ceil((float)n/8) , ceil((float)m/8));
     EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel2 , n , m);
     cudaMemcpy(results_kernel2 , d_results_kernel2 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);

     // Visualising and comparing results
     for (int i = 0 ; i < 50 ; i++)
         std::cout << "kernel1 : " << results_kernel1[i] << "  |  kernel2 : " << results_kernel2[i] << std::endl;


     return 0;

PS 我的CUDA 6.0配有NVIDIA GTX 650(计算能力3.0)

  1. 为什么我的第二个内核没有工作?
  2. 如何让我的代码运行得更快?



    1. ij的初始计算中的问题进行索引,以及存储C值的索引。
    2. 条件块内的_syncthreads()
    3. violation of usage
    4. 第1项是使代码正常工作的关键因素。



      这涉及更多。首先,你尝试增加每个线程的工作量&#34;并没有做任何类似的事情,它只是每个块的线程数增加(从128到8 * 128)。每个线程的工作量大致相同。此外,在进行此尝试的2D线程块的过程中,我相信发生了一些不好的事情:

      1. 各种合并和共享内存银行冲突的加载和存储模式被打破。
      2. 由于每个块所需的共享内存量,有效占用率下降。
      3. 第二个内核的净效果是大约加倍执行时间。所以这不是我们想要的。


        以下是沿着这些方向进行的工作。以下代码修复了第二个内核,时序基础结构,完整数据验证以及2个新内核。第一个新内核(#3)就是我所谓的“天真”#34;核心。它只是为每个输出点分配一个线程,每个线程循环遍历必要的向量,计算其各自的结果。不使用共享内存,甚至不太关注合并或任何其他优化。但是通过调整线程块配置(16,16) - &gt; (8,32)线程,我从@talonmies回答(现在已删除),这个内核比你的&#34; fast&#34;快得多(3x)。核心。在进一步考虑(8,32)观察之后,我得出结论,下一次优化尝试应该集中在:

        1. 消除使用并行缩减来计算向量距离(即允许相邻线程使用直接for循环来遍历向量)
        2. 从缓存中获益最大化
        3. 共享内存的有效使用
        4. 坚持完美的全球合并/完美使用共享内存所有读写
        5. 第4项在评论中提示了问题&#34;我可以转置矩阵吗?&#34;有了这个权限,就可以重新组织数据以方便上面的第4项。上面的第2项在我的&#34; fast&#34;内核(#4)通过将B向量加载到共享内存中,同时允许缓存主要集中于缓存A向量,希望减少缓存抖动(A是2个向量数组中较小的一个,大约2MB - fermi L2是768K,开普勒L2为1.5MB)。通过转置形式提供A,并有效地转置&#34;在共享存储器的片上B,可以使用直接for循环来计算向量距离,同时允许相邻线程具有完美合并的读取和写入,以及&#34;高效&#34;使用共享内存(即非银行冲突的加载和广播读取)。

          对于我的特定时间,(Quadro5000 cc2.0 GPU,CUDA 6,RHEL 5.5)我看到你的快速&#34;内核需要大约2秒,我的&#34;天真&#34;内核需要大约0.7秒,而我的快速&#34;内核需要约0.2秒,尽管有转置(A,C)数据。



          $ cat t460.cu 
          #include <stdio.h>
          #include <stdlib.h>
          #include <iostream>
          // both M and N must be evenly divisible by SIZE, M must be evenly divisible by CHKSIZE
          #define SIZE 128
          #define N 4000
          #define M 20000
          #define CHKSIZE 4
           __global__ void EuclideanDistances1( float *A, float *B , float *C , int n , int m)
              // SIZE is equal to 128
          __shared__ float accumResult[SIZE];
          float sA;
          float sB;
              // MAPPING
          int bx = blockIdx.x;  // n
          int by = blockIdx.y;  // m
          int ty = threadIdx.y; // 128
          //int tx = threadIdx.x; // 1
          sA = A [bx * SIZE + ty];
          sB = B [by * SIZE + ty];
          accumResult[ty] = (sA - sB) * (sA - sB);
          // Parallel tree-reduction
          for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1){
              if (ty < stride)
                  accumResult[ty] += accumResult [stride + ty];
              // Writing results to output matrix
          if ((ty == 0))
              C [bx * m + by] = accumResult[ty];
          __global__ void EuclideanDistances2( float *A, float *B , float *C, int n , int m)
          __shared__ float accumResult[SIZE][8];
          __shared__ float sA[SIZE][8];
          __shared__ float sB[SIZE][8];
          int bx = blockIdx.x;  // n / 8
          int by = blockIdx.y;  // m
          int tx = threadIdx.x; // 8
          int ty = threadIdx.y; // 128
          int i = ((bx*8) + tx) * SIZE + ty;
          int j = by * SIZE + ty;
          sA[ty][tx] = A[i];
          sB[ty][tx] = B[j];
          accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
          // Reduction
          for (int stride = SIZE/2 ; stride > 0 ; stride>>=1){
              if (ty < stride)
                  accumResult[ty][tx] += accumResult [stride + ty][tx];
          if (ty == 0)
              C[((bx*8)+tx) *  m + by] = accumResult[0][tx];
          //naive kernel
          __global__ void EuclideanDistances3( float *A, float *B , float *C, int n , int m){
            int idx = threadIdx.x+blockDim.x*blockIdx.x;
            int idy = threadIdx.y+blockDim.y*blockIdx.y;
            float result = 0.0f;
            if ((idx < n) && (idy < m)){
              for (int i = 0; i < SIZE; i++){
                float temp = A[(idx*SIZE)+i] - B[(idy*SIZE)+i];
                result += temp * temp;}
              C[(idx*m) + idy] = result;
          //optimized kernel
          __global__ void EuclideanDistances4( const float *A, const float *B , float *C, const int n , const int m){
            // n, A,  4000 this kernel assumes A is column-major A(SIZE, n)
            // m, B, 20000 this kernel assumes B is row-major    B(m, SIZE)
            // this kernel assumes C is column-major             C(m,n)
            // this kernel assumes number of threads per threadblock == SIZE
            // CHKSIZE is the number of B vectors that will be compute per block
            __shared__ float my_sB[CHKSIZE*SIZE];  // enough shared storage for CHKSIZE vectors of B
            int bx  = blockIdx.x; // one block per CHKSIZE rows of B (the larger input matrix)
            while ((bx*CHKSIZE) < m){ // not used, this while loop could be used to extend a block to multiple chunks
              int tx  = threadIdx.x;
              for (int i = 0; i < CHKSIZE; i++)  // load vectors of B into shared memory
                my_sB[(i*SIZE)+tx] = B[(((bx*CHKSIZE)+i)*SIZE)+tx];
              while (tx < n){  //loop across all vectors in A
                float result[CHKSIZE];
                for (int i = 0; i < CHKSIZE; i++)
                  result[i] = 0.0f;
                for (int i = 0; i < SIZE; i++){
                  float Atemp = A[(n*i)+tx];
                  for (int j = 0; j < CHKSIZE; j++){ // compute all CHKSIZE B vectors with read of A
                    float temp = Atemp - my_sB[i + (j*SIZE)];
                    result[j] += temp * temp;}}
                for (int i = 0; i < CHKSIZE; i++) // store CHKSIZE results
                  C[((i+(bx*CHKSIZE))*n)+ tx] = result[i];
                tx += blockDim.x;  } // continue looping across vectors in A
              __syncthreads(); // necessary to prevent warps from racing ahead, if block looping is used
              bx += gridDim.x;}
          float comp_euclid_sq(const float *rA, const float *rB, const int size){
            float result = 0.0f;
            float temp;
            for (int i = 0; i < size; i++){
              temp = (rA[i] - rB[i]);
              result += temp * temp;}
            return result;
          int main()
               float et1=0.0f, et2=0.0f, et3=0.0f, et4=0.0f;
               cudaEvent_t start1, start2, start3,start4, stop1, stop2, stop3, stop4;
               int n = N;  //MatrixA size : n * SIZE
               int m = M; //MatrixB size : m * SIZE
               // Host Allocations
               float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
               for(int i=0; i < n * SIZE; i++)
                   matrixA[i] = (float) (rand()%100)+1;
               float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
               for(int i=0; i < m * SIZE; i++)
                   matrixB[i] = (float) (rand()%100)+1;
               float *results_kernel = (float *) malloc (n * m * sizeof(float));
               float *cpu_results_kernel = (float *) malloc (n * m * sizeof(float));
               for (int i = 0; i< n*m; i++)
                 cpu_results_kernel[i] = comp_euclid_sq(matrixA + ((i/m)*SIZE), matrixB + (i%m)*SIZE, SIZE);
               //Device Allocation
               float *d_matrixA;
               float *d_matrixB;
               cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
               cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
               cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
               cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
               float *d_results_kernel;
               cudaMalloc((void **)&d_results_kernel , n * m * sizeof(float));
               dim3 threads1 (1 , SIZE);
               dim3 blocks1  (n , m);
               EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
               cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
               for (int i = 0; i< n*m; i++) {
                 if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel1 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
               cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
               cudaEventElapsedTime(&et1, start1, stop1);
               dim3 threads2 (8 , SIZE);   // 1024 threads per block (maximum)
               dim3 blocks2  (n/8 , m); // assumes n evenly divisible by 8
               EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
               cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
               for (int i = 0; i< n*m; i++) {
                 if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel2 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
               cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
               cudaEventElapsedTime(&et2, start2, stop2);
               cudaFuncSetCacheConfig(EuclideanDistances3, cudaFuncCachePreferL1);
               dim3 threads3 (8, 32);   // 1024 threads per block (maximum)
               dim3 blocks3  (n/threads3.x , m/threads3.y); // assumes evenly divisible
               EuclideanDistances3 <<<blocks3 , threads3>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
               cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
               for (int i = 0; i< n*m; i++) {
                 if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel3 mismatch at %d, cpu: %f, kernel3: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
               cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
               cudaEventElapsedTime(&et3, start3, stop3);
               // transpose matrix A
               float *matrixA_T = (float *) malloc (n * SIZE * sizeof(float));
                 for (int i = 0; i < n; i++)
                   for (int j = 0; j < SIZE; j++)
                     matrixA_T[(j*n)+i] = matrixA[(i*SIZE)+j];
               cudaMemcpy(d_matrixA , matrixA_T , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
               cudaFuncSetCacheConfig(EuclideanDistances4, cudaFuncCachePreferL1);
               dim3 threads4(SIZE); // one thread per vector element
               dim3 blocks4(m/CHKSIZE);
               EuclideanDistances4 <<<blocks4 , threads4>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
               cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
               // test for correct transposed result C(m,n)
               for (int i = 0; i< n; i++)
                 for (int j = 0; j < m; j++)
                   if (results_kernel[(j*n)+i] != cpu_results_kernel[(i*m)+j])  {printf("cpu/kernel4 mismatch at %d,%d, cpu: %f, kernel4: %f\n", i,j, cpu_results_kernel[(i*m)+j], results_kernel[(j*n)+i]); return 1;}
               cudaEventElapsedTime(&et4, start4, stop4);
               printf("kernel1 : %.fms, kernel2 : %.fms, kernel3 : %.fms, kernel4 : %.fms\n", et1, et2, et3, et4);
               return 0;
          $ nvcc -O3 -arch=sm_20 -o t460 t460.cu
          $ ./t460
          kernel1 : 2213ms, kernel2 : 4660ms, kernel3 : 691ms, kernel4 : 99ms



          我想我还应该提一下,在您提供的代码的引导下,此代码正在计算euclidean distance square 。对内核的一个微不足道的修改可以使它计算出实际的欧氏距离(C[...] = sqrtf(...);)然而,我所包含的验证假设结果是&#34;范围内&#34;用于在float中完美存储整数。您的测试用例满足此要求,但是否则需要修改验证代码(如果使用了sqrtf)。