Question

我收到了一台装有4xGPU的Tesla K80的计算机，我正在尝试使用Matlab PCT的parfor循环来加速FFT的计算，而且速度还慢。

以下是我的尝试：

% Pupil is based on a 512x512 array

    parfor zz = 1:4
        gd = gpuDevice;
        d{zz} = gd.Index;
        probe{zz} = gpuArray(pupil); 
        Essai{zz} = gpuArray(pupil); 
    end

    tic;
    parfor ii = 1:4
        gd2 = gpuDevice;
        d2{ii} = gd2.Index;
        for i = 1:100
        [Essai{ii}] = fftn(probe{ii});
        end
    end
    toc
    %%

Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Elapsed time is 1.805763 seconds.
Elapsed time is 1.412928 seconds.
Elapsed time is 1.409559 seconds.

Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers.
Elapsed time is 0.606602 seconds.
Elapsed time is 0.297850 seconds.
Elapsed time is 0.294365 seconds.
%%
tic; for i = 1:400; Essai{1} = fftn( probe{1} ); end; toc
Elapsed time is 0.193579 seconds !!!

为什么开放8名工作人员原则上更快，我只将变量存储到4gpu中（8个）？

另外，如何将Tesla K80用作单个GPU？

Merci，Nicolas

Answer 1

我怀疑parfor适用于多GPU系统。如果速度至关重要并且您想要充分利用GPU，我建议您使用cuFFT库编写自己的小CUDA脚本： http://docs.nvidia.com/cuda/cufft/#multiple-GPU-cufft-transforms

以下是如何编写包含CUDA代码的mex文件： http://www.mathworks.com/help/distcomp/run-mex-functions-containing-cuda-code.html

Answer 2

非常感谢您的快速回复和链接！确实，我试图避免使用CUDA，但它似乎是扩展FFT的最佳选择。虽然我认为parfor和spmd是多GPU的好工具..

多个特斯拉K80 GPU和parfor循环

2 个答案: