Question

我想计算矩阵的两个子矩阵之间的成对距离。例如，我有一个矩阵A（MxN）和该矩阵B1（mxn）和B2（kxt）的两个块。更具体地说，我想计算B1（1,1）元素与B2的所有其他元素的距离，并对所有B1元素执行此过程。更清楚的是，B1和B2可能不是矩阵的紧凑部分，基本上我知道的信息是矩阵A上B1和B2元素的坐标。这是一个例子。

for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
  for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2

    //CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix

    a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
    a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row

    b_x = CoordsofB2[ j ];
    b_y = CoordsofB2[ j + nRowsCoordsB2 ];

    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

现在我想用CUDA加速计算:)因为我是Cuda视角的新手，我发现它有点复杂。从现在开始，我认为我已经理解了在Matrix级别中分配块线程的逻辑，但事实上我有两个不同大小的矩阵，CoordsofB1和CoordsofB2的两个不同部分让我对如何访问它们感到困惑。在孔矩阵中坐标并使用它们。我认为我们应该在A中使用约束，但我并没有明确的想法。

同样的事实是，在for循环结束时，总和被分成一个数量让我对我们将在cuda翻译代码中合并的人感到困惑。

任何建议 - 片段 - 示例 - 参考都会很棒。

PS：我使用列主要排序的原因是因为代码在matlab中进行了评估。

UPDATE：我们可以分配大小等于最大子矩阵B1或B2大小的线程块，并使用正确的条件使用它们吗？我评论最后一行，因为我不知道如何处理它。任何意见？

int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){       

  a_x = CoordsofB1[ r ]; 
  a_y = CoordsofB1[ r + nRowsCoordsB1 ]; 
  if( r < nRowsCoordsB2 ;){

    b_x = CoordsofB2[ r ];
    b_y = CoordsofB2[ r + nRowsCoordsB2 ];
    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

这里有一个草图 enter image description here

我有B1和B2内每个元素的坐标，我想计算

中值之间的差异

[（B1（1,1） - B2（1,1））+（B1（1,1） - B2（1,2））+ ... +（B1（ 1,1） - B2（：，:)）] +

[（B1（1,2） - B2（1,1））+（B1（1,2） - B2（1,2））+ ... +（B1（ 1,2） - B2（：，:)）] +

[（B1（：，:) - B2（1,1））+（B1（：，:) - B2（1,2））+ ... +（B1（：，:) - B2（：，:)）] 。

Answer 1

如果我理解正确，那么您尝试做的事情可以用以下matlab代码编写。

rep_B1 = repmat(B1(:),  1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );

您会注意到在缩小之前，有一个大小为absdiff_B1B2 x length(B1(:))的矩阵length(B2(:))，即m*n x k*t（此矩阵）如果在一个CUDA内核中实现上述代码，则永远不会存储到全局内存中）。您可以将此矩阵划分为16x16子矩阵，并使用每个子矩阵一个256线程块将工作负载分解为GPU。

另一方面，你可以使用推力让你的生活更轻松。

更新

由于B1和B2是A的子矩阵，您可以先使用cudaMemcpy2D()将它们复制到线性空间，然后使用内核构建，然后减少矩阵absdiff_B1B2。

对于最终的规范化操作（代码的最后一行），您可以在CPU上执行此操作。

这是使用推力来展示如何在单个内核中构造和减少矩阵absdiff_B1B2的代码。但是，您会发现构造过程不使用共享内存，并且未进行优化。使用共享内存进一步优化将提高性能。

#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>

template<typename T>
struct abs_diff
{
    inline __host__ __device__ T operator()(const T& x, const T& y)
    {
        return abs(x - y);
    }
};

int main()
{
    using namespace thrust::placeholders;

    const int m = 98;
    const int n = 87;
    int k = 76;
    int t = 65;
    double result;

    thrust::device_vector<double> B1(m * n, 1.0);
    thrust::device_vector<double> B2(k * t, 2.0);

    result = thrust::inner_product(
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))),
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))) + (m * n * k * t),
            thrust::make_permutation_iterator(
                    B2.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 / (m * n))),
            0.0,
            thrust::plus<double>(),
            abs_diff<double>());
    result /= m * n * k * t;

    std::cout << result << std::endl;

    return 0;
}

Answer 2

也许以下使用2D线程网格的解决方案可以替代Eric使用推力来更深入地了解问题。

下面的代码片段仅用于说明概念。这是一个未经测试的代码。

2D网格

定义一个大小为partial_distances的{{1}}矩阵，其中包含nRowsCoordsB1 X nRowsCoordsB2和B1元素之间所有相关的绝对值差异。在B2文件中，您将拥有

main

__global__ void distance_calculator(int* partial_distances, int* CoordsofB1, int* CoordsofB2, int nRowsCoordsB1, int nRowsCoordsB2) { int i = blockDim.x * blockIdx.x + threadIdx.x; int j = blockDim.y * blockIdx.y + threadIdx.y; int a_x = CoordsofB1[i]; int a_y = CoordsofB1[i+nRowsCoordsB1]; int b_x = CoordsofB2[j]; int b_y = CoordsofB2[j+nRowsCoordsB2]; partial_distances[j*nRowsCoordsB1+i] = abs(A[a_x+a_y*nRowsofA]-A[b_x+b_y*nRowsofA]); } int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); } #define BLOCKSIZE 32 int main() { int* partial_distances; cudaMalloc((void**)&partial_distances,nRowsCoordsB1*nRowsCoordsB2*sizeof(int)); dim3 BlocSize(BLOCKSIZE,BLOCKSIZE); dim3 GridSize; GridSize.x = iDivUp(nRowsCoordsB1,BLOCKSIZE); GridSize.y = iDivUp(nRowsCoordsB2,BLOCKSIZE); distance_calculator<<<GridSize,BlockSize>>>(partial_distances,CoordsofB1,CoordsofB2,nRowsCoordsB1,nRowsCoordsB2); REDUCTION_STEP }可以实现为对1D缩减内核的迭代调用，以总结与REDUCTION_STEP的特定元素对应的所有元素。

另一种方法是使用动态并行来直接在内核中调用简化例程，但这是一个不适合您正在使用的卡的选项。

子矩阵计算

2 个答案:

更新