使用GPU有效地计算频率平均周期图

时间:2015-03-05 21:41:43

标签: performance matlab optimization gpu signal-processing

在Matlab中,我正在寻找一种最有效地计算GPU上频率平均周期图的方法。

我知道最重要的是最小化for循环并使用已经内置的GPU功能。然而,我的代码仍然感觉相对不优化,我想知道我可以做些什么改变以获得更好的加速。

r = 5;  % Dimension
n = 100;  % Time points
m = 20;  % Bandwidth of smoothing

% Generate some random rxn data 
X = rand(r, n);

% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);

% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
    Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end

% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);

% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and 
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);

% Perform filtering
for i = 1:r
    for j = 1:r
        temp = filter(w, [1], WrapPdgm(i,j,:));
        SmPdgm(i,j,:) = temp(:,:,m+1:end);
    end
end

特别是,在从傅里叶变换数据计算初始Pdgm时,我看不到优化for循环的方法,我觉得我使用WrapPdgm玩的技巧是为了如果有一个平滑的函数,那么利用GPU上的filter()感觉是不必要的。

3 个答案:

答案 0 :(得分:1)

解决方案代码

这似乎非常有效,因为下一节中的基准运行时可能会说服我们 -

%// Select the portion of FT to be processed and 
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));

%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(@times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));

%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);

%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));

基准

基准代码

%// Input parameters
r = 50; % Dimension
n = 1000;  % Time points
m = 200;  % Bandwidth of smoothing

% Generate some random rxn data
X = rand(r, n);

% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);

% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));

tic,    %// ... Code from original approach,   toc
tic     %// ... Code from proposed approach,   toc

GPU, GTX 750 Ti针对CPU, I-7 4790K -

获得的运行时结果
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.

答案 1 :(得分:0)

要摆脱第一个循环,您可以执行以下操作:

Pdgm_cell = cellfun(@(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);

然后在您的过滤器中,您可以执行以下操作:

temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);

3让过滤器知道沿着数据的第三维操作。

答案 2 :(得分:0)

您可以在GPU上使用pagefun作为第一个循环。 (请注意,cellfun的实现基本上是一个隐藏循环,而pagefun使用批量GEMM操作在GPU上本机运行)。以下是:

n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
    R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(@mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2