在GPU中使用用户定义函数和计算时,Matlab很慢

时间:2014-08-24 04:50:11

标签: performance matlab function gpu

当我运行下面显示的代码时,函数内部的tic / toc对显示需要非常短的时间(<1秒)来遍历所有行。然而,它实际上需要大约2.3秒来获得输出!我使用tic / toc对来测量时间。

tic

rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');

inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;

[H,OX] = forward_pass(rnn, inData)
toc

rnn和inData中的所有矩阵都是gpuArray,因此所有计算都在GPU中执行。输出也是gpuArray。

function [H,OX] = forward_pass(rnn, inData)
        tic;
        %initial hidden state values
        H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));

        %initialize state H
        H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');

        %initialize OX (which is H * Who)
        OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');

        for t = 1 : inData.TimeSteps

            if t == 1
                HX_t = H_init * rnn.W_hh... 
                        + inData.V(:,:,t) * rnn.W_vh;
            else
                HX_t = H(:,:,(t-1)) * rnn.W_hh... 
                        + inData.V(:,:,t) * rnn.W_vh;
            end

            H(:,:,t) = tanh(HX_t);
            OX(:,:,t) = H(:,:,t) * rnn.W_ho;


        end

        toc;
    end

通常,如果使用gather()函数,它会很慢。我没有使用gather()函数将输出传输到工作区,我不知道为什么它仍然很慢。它看起来像最后一行&#34;结束&#34;需要超过2秒。

任何人都知道如何加速函数调用?

1 个答案:

答案 0 :(得分:1)

首先,要进行正确的基准测试,您需要在函数调用内或之后使用gather。在前一种情况下,您将从函数调用中获得非gpu输出,在后一种情况下,基于gpu的数据类型将是输出。现在,回到你的问题,你使用的TimeSteps很少,因此你可能尝试的任何优化都不会以巨大的方式反映出来。这是一个优化版本,可在您增加Timesteps -

时显示更高的性能
function [H,OX] = forward_pass(rnn, inData)

H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');

T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(@plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));

for t = 2 : inData.TimeSteps
    H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
        T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end

A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);

return;

基准

测试用例#1

参数

rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;

结果

---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.

测试用例#2

参数

inData.TimeSteps = 50000; (rest are same as in Test Case #1)

结果

---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.

请注意,这些是在GTX 750 Ti上测试的。