矢量化和优化背后的想法

Question

假设我有以下9 x 5矩阵：

myArray = [
   54.7    8.1   81.7   55.0   22.5
   29.6   92.9   79.4   62.2   17.0
   74.4   77.5   64.4   58.7   22.7
   18.8   48.6   37.8   20.7   43.5
   68.6   43.5   81.1   30.1   31.1
   18.3   44.6   53.2   47.0   92.3
   36.8   30.6   35.0   23.0   43.0
   62.5   50.8   93.9   84.4   18.4
   78.0   51.0   87.5   19.4   90.4
];

我有11个＆＃34;子集＆＃34;这个矩阵，我需要在每个子集上运行一个函数（让我们说max）。可以使用以下逻辑符号来识别子集（按列标识，而不是按行标识）：

myLogicals = logical([
    0 1 0 1 1
    1 1 0 1 1
    1 1 0 0 0
    0 1 0 1 1
    1 0 1 1 1
    1 1 1 1 0
    0 1 1 0 1
    1 1 0 0 1
    1 1 0 0 1
]);

或通过线性索引：

starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends =   [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset

这样第一个子集是2：3，第二个子集是5：6，依此类推。

我可以找到每个子集的max并将其存储在向量中，如下所示：

finalAnswers = NaN(11,1); 
for n=1:length(starts) #%i.e. 1 through the number of subsets
    finalAnswers(n) = max(myArray(starts(n):ends(n)));
end

循环运行后，finalAnswers包含每个数据子集的最大值：

74.4  68.6  78.0  92.9  51.0  81.1  62.2  47.0  22.5  43.5  90.4

是否可以在不使用for循环的情况下获得相同的结果？换句话说，这个代码可以被矢量化吗？这种方法会比现在的方法更有效吗？

编辑：我对提出的解决方案做了一些测试。我使用的数据是1,510 x 2,185矩阵，有10,103个子集，长度从2到916不等，子集长度标准差为101.92。

我在tic;for k=1:1000 [code here] end; toc;中包装了每个解决方案，结果如下：

for循环方法--- Elapsed time is 16.237400 seconds.
Shai的方法--- Elapsed time is 153.707076 seconds.
Dan的方法--- Elapsed time is 44.774121 seconds.
Divakar的方法＃2 --- Elapsed time is 127.621515 seconds.

注意：

我还尝试通过将k=1:1000 for循环包裹在accumarray行周围来对Dan的方法进行基准测试（因为其余的可能是理论上只运行一次）。在这种情况下，时间是28.29 秒。
对Shai的方法进行基准测试，同时保留lb = ...线 k循环，时间是113.48秒。
当我运行Divakar的代码时，Non-singleton dimensions of the two input arrays must match each other.行出现bsxfun错误。我＆＃34;修复＆＃34;这通过使用共轭转座（撇号 '和trade_starts(1:starts_extent)上的运算符intv(1:starts_extent)）调用bsxfun的代码行中的for。我＆＃39;米不确定为什么会出现这个错误...

我不确定我的基准测试设置是否正确，但看起来{{1}}循环在这种情况下实际运行速度最快。

Answer 1

一种方法是使用accumarray。不幸的是，为了做到这一点，我们首先需要“标记”你的逻辑矩阵。如果您没有图像处理工具箱，这是一种令人费解的方式：

sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))

s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);

这就是Shai的bwlabeln实现所做的事情（但这将是1 - 由 - numel(myLogicals)形状而不是形状size(myLogicals)

现在您可以使用accumarray：

accumarray(labelled(myLogicals), myArray(myLogicals), [], @max)

或者尝试

可能会更快

result = accumarray(labelled+1, myArray(:), [], @max);
result = result(2:end)

这是完全矢量化的，但值得吗？您必须对您的循环解决方案进行速度测试才能知道。

Answer 2

使用bwlabeln与垂直连接：

lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );

现在每个地区都有一个标签1..11。

要获得最大值，您可以使用regionprops

props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];

您可以使用regionprops获取每个子集的其他属性，但这不是一般的如果您希望对每个地区应用更一般的功能，例如median，您可以使用accumarray：

finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], @median );

Answer 3

矢量化和优化背后的想法

可以用来矢量化这个问题的方法之一是将子集转换为规则形状的块，然后找到元素的最大值一次性的那些块。现在，转换为常规形状的块在这里有一个问题，即子集的长度不相等。为了避免这个问题，可以从每个starts元素开始创建索引的2D矩阵，并延伸到子集长度的最大值。关于这一点的好处是，它允许矢量化，但代价是更多的内存需求，这取决于子集长度的分散性。

这种矢量化技术的另一个问题是它可能导致最终子集的超限索引创建。为避免这种情况，可以考虑两种可能的方法 -

通过扩展输入数组使用更大的输入数组，使得子集长度加上起始索引的最大值仍然位于扩展阵列。
使用原始输入数组进行启动，直到我们处于原始输入数组的限制范围内，然后对其余子集使用原始循环代码。我们可以将它称为混合编程，只是为了拥有一个简短的标题。这将节省我们在创建扩展阵列时的内存要求，如前面的其他方法所述。

接下来列出了这两种方式/方法。

方法＃1：矢量化技术

[m,n] = size(myArray); %// store no. of rows and columns in input array

intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
max_intv_arr = [0:max_intv]'; %//'# array of max indices extent

[row1,col1] = ind2sub([m n],starts); %// get starts row and column indices

m_ext = max(row1+max_intv); %// no. of rows in extended input array

myArrayExt(m_ext,n)=0; %// extended form of input array
myArrayExt(1:m,:) = myArray;

%// New linear indices for extended form of input array
idx = bsxfun(@plus,max_intv_arr,(col1-1)*m_ext+row1); 

%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArrayExt(idx);                  
selected_ele(bsxfun(@gt,max_intv_arr,intv))= nan;

%// Get the max of the valid ones for the desired output
out = nanmax(selected_ele);   %// desired output

方法＃2：混合编程

%// PART - I: Vectorized technique for subsets that when normalized
%// with max extents still lie within limits of input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval

%// Find the last subset that when extended by max interval would still
%// lie within the limits of input array
starts_extent = find(starts+max_intv<=numel(myArray),1,'last');
max_intv_arr = [0:max_intv]'; %//'# Array of max indices extent

%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArray(bsxfun(@plus,max_intv_arr,starts(1:starts_extent)));
selected_ele(bsxfun(@gt,max_intv_arr,intv(1:starts_extent))) = nan;

out(numel(starts)) = 0; %// storage for output
out(1:starts_extent) = nanmax(selected_ele); %// output values for part-I

%// PART - II: Process rest of input array elements
for n = starts_extent+1:numel(starts)
    out(n) = max(myArray(starts(n):ends(n)));
end

基准

在本节中，我们将比较两种方法和原始循环代码之间的性能。在开始实际基准测试之前，让我们设置代码 -

N = 10000; %// No. of subsets
M1 = 1510; %// No. of rows in input array
M2 = 2185; %// No. of cols in input array
myArray = rand(M1,M2);  %// Input array
num_runs = 50; %// no. of runs for each method

%// Form the starts and ends by getting a sorted random integers array from
%// 1 to one minus no. of elements in input array. That minus one is
%// compensated later on into ends because we don't want any subset with
%// starts and ends as the same index
y1 = reshape(sort(randi(numel(myArray)-1,1,2*N)),2,[]);
starts = y1(1,:);
ends = y1(1,:)+1;

%// Remove identical starts elements
invalid = [false any(diff(starts,[],2)==0,1)];
starts = starts(~invalid);
ends = ends(~invalid);

%// Create myLogicals
myLogicals = false(size(myArray));
for k1=1:numel(starts)
    myLogicals(starts(k1):ends(k1))=1;
end

clear invalid y1 k1 M1 M2 N %// clear unnecessary variables

%// Warm up tic/toc.
for k = 1:100
    tic(); elapsed = toc();
end

现在，安慰剂代码可以让我们获得运行时 -

disp('---------------------- With Original loop code')
tic
for iter = 1:num_runs
    %// ...... approach #1 codes
end
toc
%// clear out variables used in the above approach
%// repeat this for approach #1,2

基准测试结果

在您的评论中，您提到使用1510 x 2185 matrix，因此，让我们使用大小为10000和2000的大小和子集进行两次大小写运行。

案例1 [输入 - 1510 x 2185矩阵，子集 - 10000]

---------------------- With Original loop code
Elapsed time is 15.625212 seconds.
---------------------- With Approach #1
Elapsed time is 12.102567 seconds.
---------------------- With Approach #2
Elapsed time is 0.983978 seconds.

案例2 [输入 - 1510 x 2185矩阵，子集 - 2000]

---------------------- With Original loop code
Elapsed time is 3.045402 seconds.
---------------------- With Approach #1
Elapsed time is 11.349107 seconds.
---------------------- With Approach #2
Elapsed time is 0.214744 seconds.

案例3 [更大输入 - 3000 x 3000矩阵，子集 - 20000]

---------------------- With Original loop code
Elapsed time is 12.388061 seconds.
---------------------- With Approach #1
Elapsed time is 12.545292 seconds.
---------------------- With Approach #2
Elapsed time is 0.782096 seconds.

请注意，运行次数num_runs是不同的，以使最快进近的运行时间接近1 sec。

结论

所以，我想混合编程（方法＃2）是要走的路！作为未来的工作，如果性能受到分散的影响并且将大多数分散子集（就其长度而言）的工作卸载到循环代码中，则可以将standard deviation用于分散标准。

Answer 4

效率

同时衡量vectorised＆amp;各个平台上的for-loop代码示例（无论是＆lt; localhost ＆gt;还是基于云计算）才能看到差异：

MATLAB:7> tic();max( myArray( startIndex(:):endIndex(:) ) );toc() %% Details
Elapsed time is 0.0312 seconds.                                   %% below.
                                                                  %% Code is not
                                                                  %% the merit,
                                                                  %% method is:

和

tic();                                                            %% for/loop
for n = 1:length( startIndex )                                    %% may be
    max( myArray( startIndex(n):endIndex(n) ) );                  %% significantly
end                                                               %% faster than
toc();                                                            %% vectorised
Elapsed time is 0.125 seconds.                                    %% setup(s)
                                                                  %% overhead(s)
%% As commented below,
%% subsequent re-runs yield unrealistic results due to caching artifacts
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.

%% which are not so straight visible if encapsulated in an artificial in-vitro
%% via an outer re-run repetitions ( for k=1:1000 ) et al ( ref. in text below )

为了更好地解释测试结果，而不是测试更大的尺寸而不是几十行/列。

编辑：删除了错误的代码，感谢Dan的通知。更加注意强调定量验证，这可能证明了矢量化代码可能（但并非在所有情况下都不需要更快）的假设并不是错误代码的借口，当然。

输出 - 定量比较数据：

虽然推荐，但没有IMHO公平的假设，memalloc和类似的开销被排除在体内测试之外。测试重新运行通常显示VM页面命中改进，其他缓存工件，而原始的第一个“原始”运行通常出现在实际代码部署中（当然，不包括外部迭代器）。因此，在您的真实环境中仔细考虑结果并重新测试（有时在更大的系统中作为虚拟机运行 - 这也使得一旦巨大的矩阵开始对实际内存访问造成伤害，必须考虑VM交换机制模式）。

在其他项目上，我习惯使用实时测试时间的[usec]粒度，但是需要考虑更多关于测试执行条件和O / S背景的关注

因此，测试只能为您的特定代码/部署情况提供相关的答案，但是要比较原则上可比较的数据。

Alarik的代码：

MATLAB:8> tic(); for k=1:1000 % ( flattens memalloc issues & al ) > for n = 1:length( startIndex ) > max( myArray( startIndex(n):endIndex() ) ); > end; > end; toc() Elapsed time is 0.2344 seconds. %% time is 0.0002 seconds per k-for-loop <--[ ref.^ remarks on testing ]

Dan的代码：

MATLAB:9> tic(); for k=1:1000 > s_ind( size( myLogicals ) ) = 0; > s_ind( startIndex ) = 1; > labelled = cumsum( s_ind(:) ).*myLogicals(:); > result = accumarray( labelled + 1, myArray(:), [], @max ); > end; toc() error: product: nonconformant arguments (op1 is 43x1, op2 is 45x1) %% %% [Work in progress] to find my mistake -- sorry for not being able to reproduce %% Dan's code and to make it work %% %% Both myArray and myLogicals shape was correct ( 9 x 5 )

如何在较大矩阵的子集上矢量化运行函数的代码？

4 个答案:

矢量化和优化背后的想法

方法＃1：矢量化技术

方法＃2：混合编程

基准

基准测试结果

结论

效率

输出 - 定量比较数据：