当我运行下面显示的代码时,函数内部的tic / toc对显示需要非常短的时间(<1秒)来遍历所有行。然而,它实际上需要大约2.3秒来获得输出!我使用tic / toc对来测量时间。
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
rnn和inData中的所有矩阵都是gpuArray,因此所有计算都在GPU中执行。输出也是gpuArray。
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
通常,如果使用gather()函数,它会很慢。我没有使用gather()函数将输出传输到工作区,我不知道为什么它仍然很慢。它看起来像最后一行&#34;结束&#34;需要超过2秒。
任何人都知道如何加速函数调用?
答案 0 :(得分:1)
首先,要进行正确的基准测试,您需要在函数调用内或之后使用gather
。在前一种情况下,您将从函数调用中获得非gpu输出,在后一种情况下,基于gpu的数据类型将是输出。现在,回到你的问题,你使用的TimeSteps
很少,因此你可能尝试的任何优化都不会以巨大的方式反映出来。这是一个优化版本,可在您增加Timesteps
-
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(@plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
测试用例#1
参数
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
结果
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
测试用例#2
参数
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
结果
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
请注意,这些是在GTX 750 Ti上测试的。