加速限制改组。 GPU(Tesla K40m),MATLAB中的CPU并行计算

时间:2017-11-30 09:36:15

标签: matlab performance parallel-processing shuffle gpu-programming

我有100盏灯。他们在眨眼。我在一段时间内观察它们。对于每个灯,我计算闪烁之间的间隔的平均值,标准和自相关。 现在我应该对观察到的数据进行重新采样并保持排列,其中所有参数(均值,标准,自相关)都在某个范围内。我工作的代码很好。但每轮实验需要很长时间(一周)。我在具有12个内核和2个Tesla K40m GPU的计算服务器上进行此操作(详情最后)。

我的代码:

close all
clear all
clc
% open parpool skip error if it was opened
try parpool(24); end

% Sample input. It is faked, just for demo.
% Number of "lamps" and number of "blinks" are similar to real.
NLamps = 10^2;
NBlinks = 2*10^2;
Events = cumsum([randg(9,NLamps,NBlinks)],2); % each row - different "lamp"
DurationOfExperiment=Events(:,end).*1.01;

%% MAIN
% Define parameters
nLags=2; % I need to keep autocorrelation with lags 1-2
alpha=[0.01,0.1]; % range of allowed relative deviation from observed 
                  % parameters should be > 0 to avoid generating original
                  % sequence
nPermutations=10^2; % In original code 10^5                  

% Processing of experimental data                  
DurationOfExperiment=num2cell(DurationOfExperiment);
Events=num2cell(Events,2);
Intervals=cellfun(@(x) diff(x),Events,'UniformOutput',false);
observedParams=cellfun(@(x) fGetParameters(x,nLags),Intervals,'UniformOutput',false);
observedParams=cell2mat(observedParams);

% Constrained shuffling. EXPENSIVE PART!!!
while true
    parfor iPermutation=1:nPermutations
        % Shuffle intervals
        shuffledIntervals=cellfun(@(x,y) fPermute(x,y),Intervals,DurationOfExperiment,'UniformOutput',false); 
        % get parameters of shuffled intervals
        shuffledParameters=cellfun(@(x) fGetParameters(x,nLags),shuffledIntervals,'UniformOutput',false);
        shuffledParameters=cell2mat(shuffledParameters);
        % get relative deviation
        delta=abs((shuffledParameters-observedParams)./observedParams);
        % find shuffled Lamps, which are inside alpha range
        MaximumDeviation=max(delta,[] ,2);
        MinimumDeviation=min(delta,[] ,2);
        LampID=find(and(MaximumDeviation<alpha(2),MinimumDeviation>alpha(1)));
        % if shuffling of ANY lamp was succesful, save these Intervals
        if ~isempty(LampID)
            shuffledIntervals=shuffledIntervals(LampID);
            shuffledParameters=shuffledParameters(LampID,:);
            parsave( LampID,shuffledIntervals,shuffledParameters);
            'DONE'
        end
    end
end



%% FUNCTIONS
function [ params ] = fGetParameters( intervals,nLags )
% Calculate [mean,std,autocorrelations with lags from 1 to nLags
    R=nan(1,nLags);
    for lag=1:nLags
            R(lag) = corr(intervals(1:end-lag)',intervals((1+lag):end)','type','Spearman');
    end
    params = [mean(intervals),std(intervals),R];
end
%--------------------------------------------------------------------------
function [ Intervals ] = fPermute( Intervals,Duration )
    % Create long shuffled time-series
    Time=cumsum([0,datasample(Intervals,numel(Intervals)*3)]);
    % Keep the same duration
    Time(Time>Duration)=[];
    % Calculate Intervals
    Intervals=diff(Time);
end
%--------------------------------------------------------------------------
function parsave( LampID,Intervals,params)
    save([num2str(randi(10^9)),'.mat'],'LampID','Intervals','params')
end

服务器规格:

>>gpuDevice() 
CUDADevice with properties:

                      Name: 'Tesla K40m'
                     Index: 1
         ComputeCapability: '3.5'
            SupportsDouble: 1
             DriverVersion: 8
            ToolkitVersion: 8
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 1.1979e+10
           AvailableMemory: 1.1846e+10
       MultiprocessorCount: 15
              ClockRateKHz: 745000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1
>> feature('numcores')
MATLAB detected: 12 physical cores.
MATLAB detected: 24 logical cores.
MATLAB was assigned: 24 logical cores by the OS.
MATLAB is using: 12 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.

>> system('for /f "tokens=2 delims==" %A in (''wmic cpu get name /value'') do @(echo %A)')
Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz  
Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz  

>> memory
Maximum possible array:               496890 MB (5.210e+11 bytes) *
Memory available for all arrays:      496890 MB (5.210e+11 bytes) *
Memory used by MATLAB:                 18534 MB (1.943e+10 bytes)
Physical Memory (RAM):                262109 MB (2.748e+11 bytes)

*  Limited by System Memory (physical + swap file) available.

问题:

是否有可能加速我的计算?我考虑CPU + GPU计算,但我无法理解如何做到(我没有使用gpuArrays的经验)。而且,我不确定这是个好主意。有时一些算法优化会带来更大的利润,然后是并行计算

P.S。 节省步骤不是瓶颈 - 在最好的情况下,它会在10-30分钟内发生一次。

1 个答案:

答案 0 :(得分:1)

基于GPU的处理仅适用于某些功能和正确的卡(如果我没记错的话)。

对于问题的 GPU 部分,MATLAB有一个list of available functions - 您可以在GPU上运行 - 代码中最昂贵的部分是函数 {{1}很遗憾,不在列表中。

如果探查器没有突出显示瓶颈 - 会发生一些奇怪的事情......所以我在上面的代码上运行了一些测试:

corr

这不到一周......

我是否提到我在 nPermutations = 10^0 iteration takes ~0.13 seconds nPermutations = 10^1 iteration takes ~1.3 seconds nPermutations = 10^3 iteration takes ~130 seconds nPermutations = 10^4 probably takes ~1300 seconds nPermutations = 10^5 probably takes ~13000 seconds 声明中添加了 break - 因为我在您的代码中看不到永远“突破”while循环 - 我希望你的缘故,这不是你的功能永远运行的原因....

while