我收到了一台装有4xGPU的Tesla K80的计算机,我正在尝试使用Matlab PCT的parfor循环来加速FFT的计算,而且速度还慢。
以下是我的尝试:
% Pupil is based on a 512x512 array
parfor zz = 1:4
gd = gpuDevice;
d{zz} = gd.Index;
probe{zz} = gpuArray(pupil);
Essai{zz} = gpuArray(pupil);
end
tic;
parfor ii = 1:4
gd2 = gpuDevice;
d2{ii} = gd2.Index;
for i = 1:100
[Essai{ii}] = fftn(probe{ii});
end
end
toc
%%
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Elapsed time is 1.805763 seconds.
Elapsed time is 1.412928 seconds.
Elapsed time is 1.409559 seconds.
Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers.
Elapsed time is 0.606602 seconds.
Elapsed time is 0.297850 seconds.
Elapsed time is 0.294365 seconds.
%%
tic; for i = 1:400; Essai{1} = fftn( probe{1} ); end; toc
Elapsed time is 0.193579 seconds !!!
为什么开放8名工作人员原则上更快,我只将变量存储到4gpu中(8个)?
另外,如何将Tesla K80用作单个GPU?
Merci,Nicolas答案 0 :(得分:1)
我怀疑parfor适用于多GPU系统。如果速度至关重要并且您想要充分利用GPU,我建议您使用cuFFT库编写自己的小CUDA脚本: http://docs.nvidia.com/cuda/cufft/#multiple-GPU-cufft-transforms
以下是如何编写包含CUDA代码的mex文件: http://www.mathworks.com/help/distcomp/run-mex-functions-containing-cuda-code.html
答案 1 :(得分:0)
非常感谢您的快速回复和链接!确实,我试图避免使用CUDA,但它似乎是扩展FFT的最佳选择。 虽然我认为parfor和spmd是多GPU的好工具..