Question

我正在将OpenCV用于计算机视觉应用。我想在GPU上加速一些矩阵运算（矩阵相当大），并希望尽可能避免直接在CUDA C中编码。 OpenCV 2.4.1具有许多GPU加速功能。他们的体验表现如何？我最好还是使用另一个库（例如Thrust）吗？

修改示例应用程序：Calculate squared Euclidean distance matrix on GPU。目前，我在Matlab中使用并行计算工具箱（PCT）进行的GPU加速（和矢量化）实现比使用OpenCV的C ++实现快5到10倍。

Matlab实现：

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);

[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);

pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);

% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';

end

UPDATE 这是另一个实现相同功能的Matlab实现（感谢https://stackoverflow.com/a/7774323/1121420）。但它仅在CPU上运行，因为PCT不支持bsxfun。仍然在寻找C ++替代品。

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.

K = bsxfun(@plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');

end

Answer 1

我发现ArrayFire要快得多，并且已开始使用它而不是OpenCV中的GPU内核进行图像处理。以下是some benchmarks我发现将ArrayFire（以前称为LibJacket的不同接口）与OpenCV进行比较，在我的基准测试中也是如此，ArrayFire比OpenCV中的GPU功能快2-4倍。据我所知，NVIDIA没有在OpenCV中编写GPU内核，而是将这些内容与某人签约，这可能就是他们如此缓慢的原因。由于我只使用1个GPU，我可以免费使用ArrayFire。

更新，鉴于@Alex发布的新MATLAB代码：我在我的系统上运行了此代码的基准测试。我得知并行计算工具箱gpuArray比CPU慢，但Jacket和ArrayFire踢了一下。硬件规格是：

Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
NVIDIA Tesla M2090

使用Parallel Computing Toolbox gpuArray（完全预热）的CPU与GPU的结果。 CPU比PCT gpuArray ：

更快

>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.

使用Jacket的CPU与GPU的结果（完全预热）。 Jacket击败PCT gpuArray 3.7倍并击败CPU 3倍

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.

以下是修改后的代码，让您轻松运行：

function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

[nP, d] = size(P);
[nQ, d] = size(Q);

pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);

K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';

end

Jacket确实在GPU上支持BSXFUN，它确实提高了速度：

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.

请注意，此处使用的尺寸非常小，因此尝试在这些小尺寸上运行的大多数CUDA代码可能表现不佳。这就是为什么我喜欢使用AccelerEyes的东西，因为那些人已经优化了GPU的优势，不像PCT gpuArray，Thrust，OpenCV，我过去曾尝试过这些。

这是ArrayFire Free C ++的结果：

Time:  0.0003577 seconds
Speedups:  19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN

这是我为此写的ArrayFire代码：

static array SqEuclideanDist(array P, array Q)
{
    // 0 based indexing
    array pmag = sum(P * P, 1);
    array qmag = sum(Q * Q, 1);

    int np = P.dims(0);
    int nq = Q.dims(0);

    array K = tile(qmag.T(), np, 1) * tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
    return K;
}

int main(int argc, char **argv)
{
    double *P_cpu = new double[1581 * 3];
    double *Q_cpu = new double[189 * 3];

    array P = array(1581, 3, P_cpu);
    array Q = array(189 , 3, Q_cpu);
    af::sync();

    int iter = 1000;

    timer::tic();
    for (int i = 0; i < iter; i++) {
        array K = SqEuclideanDist(P, Q);
        af::eval(K);
    }

    af::sync();
    printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);

    delete[] P_cpu;
    delete[] Q_cpu;
}

Answer 2

他们是由NVidia提供的，因此在CUDA兼容卡上也有很好的表现。真正的性能取决于卡本身和您正在使用的功能。

根据我的经验，只有cvRotate和cvResize比普通的Intel cpu具有更好的性能。（注意：我只对图像相关的功能感兴趣）

OpenCV GPU库对矩阵运算有多好？

2 个答案: