Question

我有两个数组。一个是另一个内的长度列表。例如

zarray = [1 2 3 4 5 6 7 8 9 10]

和

lengths = [1 3 2 1 3]

我希望对第一个数组的部分进行平均（平均），其长度由第二个数组给出。对于此示例，导致：

[mean([1]),mean([2,3,4]),mean([5,6]),mean([7]),mean([8,9,10])]

为了速度，我试图避免循环。我尝试使用mat2cell和cellfun如下

zcell = mat2cell(zarray,[1],lengths);
zcellsum = cellfun('mean',zcell);

但是，cellfun部分非常慢。有没有办法在没有循环或cellfun的情况下做到这一点？

Answer 1

这是一个完全向量化的解决方案（没有明确的for循环，或带有ARRAYFUN，CELLFUN的隐藏循环，......）。我们的想法是使用极其快速的ACCUMARRAY函数：

%# data
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];

%# generate subscripts: 1 2 2 2 3 3 4 5 5 5
endLocs = cumsum(lengths(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);

%# mean of each part
means = accumarray(subs, zarray) ./ lengths(:)

在这种情况下的结果：

速度测试：

考虑以下不同方法的比较。我正在使用 Steve Eddins的<{3}}函数：

function [t,v] = testMeans()
    %# generate test data
    [arr,len] = genData();

    %# define functions
    f1 = @() func1(arr,len);
    f2 = @() func2(arr,len);
    f3 = @() func3(arr,len);
    f4 = @() func4(arr,len);

    %# timeit
    t(1) = timeit( f1 );
    t(2) = timeit( f2 );
    t(3) = timeit( f3 );
    t(4) = timeit( f4 );

    %# return results to check their validity
    v{1} = f1();
    v{2} = f2();
    v{3} = f3();
    v{4} = f4();
end

function [arr,len] = genData()
    %#arr = [1 2 3 4 5 6 7 8 9 10];
    %#len = [1 3 2 1 3];

    numArr = 10000;     %# number of elements in array
    numParts = 500;     %# number of parts/regions      
    arr = rand(1,numArr);
    len = zeros(1,numParts);
    len(1:end-1) = diff(sort( randperm(numArr,numParts) ));
    len(end) = numArr - sum(len);
end

function m = func1(arr, len)
    %# @Drodbar: for-loop
    idx = 1;
    N = length(len);
    m = zeros(1,N);
    for i=1:N
        m(i) = mean( arr(idx+(0:len(i)-1)) );
        idx = idx + len(i);
    end
end

function m = func2(arr, len)
    %# @user1073959: MAT2CELL+CELLFUN
    m = cellfun(@mean, mat2cell(arr, 1, len));
end

function m = func3(arr, len)
    %# @Drodbar: ARRAYFUN+CELLFUN
    idx = arrayfun(@(a,b) a-(0:b-1), cumsum(len), len, 'UniformOutput',false);
    m = cellfun(@(a) mean(arr(a)), idx);
end

function m = func4(arr, len)
    %# @Amro: ACCUMARRAY
    endLocs = cumsum(len(:));
    subs = zeros(endLocs(end),1);
    subs([1;endLocs(1:end-1)+1]) = 1;
    subs = cumsum(subs);

    m = accumarray(subs, arr) ./ len(:);
    if isrow(len)
        m = m';
    end
end

以下是时间安排。测试在具有MATLAB R2012a的WinXP 32位机器上进行。我的方法比所有其他方法快一个数量级。 For-loop是第二好的。

>> [t,v] = testMeans();
>> t
t =
   0.013098   0.013074   0.022407   0.00031807
    |           |          |          \_________ @Amro: ACCUMARRAY (!)
    |           |           \___________________ @Drodbar: ARRAYFUN+CELLFUN
    |            \______________________________ @user1073959: MAT2CELL+CELLFUN
     \__________________________________________ @Drodbar: FOR-loop

此外，所有结果都是正确且相等的 - 差异的大小为eps机器精度（由不同的累积误差累积方式引起），因此被认为是垃圾并且被忽略了：

%#assert( isequal(v{:}) )
>> maxErr = max(max( diff(vertcat(v{:})) ))
maxErr =
   3.3307e-16

Answer 2

以下是使用arrayfun和cellfun

的解决方案

zarray  = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];

% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);

您想要的输出结果：

means =

    1.0000    3.0000    5.5000    7.0000    9.0000

关注@tmpearce评论我在上面的解决方案之间进行了快速的性能比较，我从中创建了一个名为subsetMeans1的函数

function means = subsetMeans1( zarray, lengths)

% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);

和一个简单的for循环替代方法，函数subsetMeans2。

function means = subsetMeans2( zarray, lengths)

% Method based on single loop
idx = 1;
N = length(lengths);
means = zeros( 1, N);
for i = 1:N
    means(i) = mean( zarray(idx+(0:lengths(i)-1)) );
    idx = idx+lengths(i);
end

使用下一个基于TIMEIT的测试脚本，它允许检查性能，改变输入向量上的元素数量和每个子集的元素大小：

% Generate some data for the performance test

% Total of elements on the vector to test
nVec = 100000;

% Max of elements per subset
nSubset = 5;

% Data generation aux variables
lenghtsGen = randi( nSubset, 1, nVec);
accumLen = cumsum(lenghtsGen);
maxIdx = find( accumLen < nVec, 1, 'last' );

% % Original test data
% zarray  = [1 2 3 4 5 6 7 8 9 10];
% lengths = [1 3 2 1 3];

% Vector to test
zarray = 1:nVec;
lengths = [ lenghtsGen(1:maxIdx) nVec-accumLen(maxIdx)] ;

% Double check that nVec is will be the max index
assert ( sum(lengths) == nVec)

t1(1) = timeit(@() subsetMeans1( zarray, lengths));
t1(2) = timeit(@() subsetMeans2( zarray, lengths));

fprintf('Time spent subsetMeans1: %f\n',t1(1));
fprintf('Time spent subsetMeans2: %f\n',t1(2));

事实证明，没有arrayfun和cellfun的非矢量化版本更快，可能是由于这些函数的额外开销

Time spent subsetMeans1: 2.082457
Time spent subsetMeans2: 1.278473

八度音程中阵列部分的平均值

2 个答案:

速度测试：