我在MATLAB中编写了一段用于计算自商数图像(SQI)的代码。现在我想并行重写其中的一部分以加速。 这部分代码是:
siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
means= mean(region(:));
M=return_step(region, means);
filt1=filt.*M;
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
和return_step函数:
function M=return_step(X, means)
[a,b]=size(X);
for i=1:a
for j=1:b
if X(i,j)>=means
M(i,j)=1;
end
end
end
我写了下面的内核函数:
__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
int idx=threadIdx.y*blockDim.x+threadIdx.x;
if(idx>=leng) return;
int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
double xval= x[ridx];
if (xval>=mean) m[idx]=filt[idx]*xval;
else m[idx]=0;
}
然后更改MATLAB代码如下:
kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);
%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
means= mean(region(:));
GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
filt1= gather(GM);
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
我的顺序代码在2.5秒内运行330X200图像,但新并行代码的运行时间为15秒。我不知道为什么???? 我需要一些建议来改进它。我是CUDA编程的新手。
答案 0 :(得分:1)
> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....
filt1 = gather(GM)在每一步都将GM从GPU复制到CPU,这是非常低效的。您应该将整个计算移动到循环嵌套中,或者最好将整个循环嵌套移动到GPU内核。否则你可以忘记任何加速。
答案 1 :(得分:0)
我在Sobel滤镜下的评估显示CPU在小图像上的性能优于GPU。我认为你的图像尺寸太小,无法比较CPU-GPU的性能。计算应该足够大,以隐藏内核和通信启动开销。