Question

我有GPU加速的MATLAB代码，它花费了80％-90％的时间进行计算

sum(a.*exp(b.*c),1)

其中

size(a) = [n 1]
size(b) = [n 1]
size(c) = [1 m]

可以选择

n 为任意大小（在内存限制内）

5000 < m <20000

除了使用gpuArrays（双精度大约17倍）之外，我还想进一步提高速度。

基准化

使用MATLAB 2018b和NVIDIA P100 GPU，我运行了以下脚本，旨在找到 n 的最佳大小。它表明，使用双精度技术，与CPU（双插槽Intel Xeon E5-2650v2）相比，我实现了17倍的加速。我是否可以通过做一些更高级的事情（例如使用GPU编码器，或者甚至如下所述的共享内存或纹理内存）来改进它？ https://uk.mathworks.com/help/parallel-computing/examples/accessing-advanced-cuda-features-using-mex.html

%% Optimisation MWE

nVec = 1000:1000:60000; % Vector of candidate n values
m = 5000;

f1 = figure(1);
ax(1) = subplot(3,1,1);
ax(2) = subplot(3,1,2);
ax(3) = subplot(3,1,3);

% Preallocate time outputs
t = nan(length(nVec),3);
speedupGPU = nan(length(nVec),2);

% Loop over candidate n values
for n = 1:length(nVec)

    %% CPU code
    a = rand(nVec(n),1);
    b = rand(nVec(n),1);
    c = rand(1,m);

    f1 = @() sum(a.*exp(b.*c),1);

    t(n,1) = timeit(f1,1);

    %% GPU code (double precision)
    a = gpuArray(a);
    b = gpuArray(b);
    c = gpuArray(c);

    f2 = @() sum(a.*exp(b.*c),1);

    t(n,2) = gputimeit(f2);

    %% GPU code (single precision)
    a = single(a);
    b = single(b);
    c = single(c);

    f3 = @() sum(a.*exp(b.*c),1);

    t(n,3) = gputimeit(f3);

    %% Calculate speedup
    speedupGPU(n,1) = t(n,1)/t(n,2);
    speedupGPU(n,2) = t(n,1)/t(n,3);

    %% Plot
    plot(ax(1),nVec,t,'.-')             % Plot compute time
    plot(ax(2),nVec,t./nVec(:),'.-')    % Plot normalised compute time
    plot(ax(3),nVec,speedupGPU,'.-')    % Plot Speedup

    %% Label plots
    xlabel(ax(1),'n')
    ylabel(ax(1),'Time')
    legend(ax(1),'CPU','GPU double','GPU single')

    xlabel(ax(2),'n')
    ylabel(ax(2),'Normalised Time')
    legend(ax(2),'CPU','GPU double','GPU single')

    xlabel(ax(3),'n')
    ylabel(ax(3),'Speedup')
    legend(ax(3),'CPU/GPU double','CPU/GPU single')

    drawnow

end

结果如下图（上图：执行时间随n的增加（越小越好），中：按n标准化的执行时间（越小越好），下图：相对于CPU的加速（越大越好））：< / p>

Answer 1

我意识到这可能无法为您提供所需的加速，但是使此代码更具性能的一种方法是通过使用矩阵乘法来摆脱sum：

sum(a.*exp(b.*c),1) --> a.'*exp(b.*c)

在我的系统上，这导致加速从大约10增加到大约15。

我还应该提到，对于最低的n，我还用矩阵乘法（.*代替了数组乘法（*），从而使速度提高了约20倍： a.'*exp(b.*c) --> a.'*exp(b*c)。

在R2019b，Win10，GTX660上进行了测试。

使用MATLAB的GPU功能计算sum（a。* exp（b。* c），1）的有效方法

基准化

1 个答案: