我有两个数组。一个是另一个内的长度列表。例如
zarray = [1 2 3 4 5 6 7 8 9 10]
和
lengths = [1 3 2 1 3]
我希望对第一个数组的部分进行平均(平均),其长度由第二个数组给出。对于此示例,导致:
[mean([1]),mean([2,3,4]),mean([5,6]),mean([7]),mean([8,9,10])]
为了速度,我试图避免循环。我尝试使用mat2cell和cellfun如下
zcell = mat2cell(zarray,[1],lengths);
zcellsum = cellfun('mean',zcell);
但是,cellfun部分非常慢。有没有办法在没有循环或cellfun的情况下做到这一点?
答案 0 :(得分:2)
这是一个完全向量化的解决方案(没有明确的for循环,或带有ARRAYFUN,CELLFUN的隐藏循环,......)。我们的想法是使用极其快速的ACCUMARRAY函数:
%# data
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];
%# generate subscripts: 1 2 2 2 3 3 4 5 5 5
endLocs = cumsum(lengths(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);
%# mean of each part
means = accumarray(subs, zarray) ./ lengths(:)
在这种情况下的结果:
means =
1
3
5.5
7
9
考虑以下不同方法的比较。我正在使用 Steve Eddins的<{3}}函数:
function [t,v] = testMeans()
%# generate test data
[arr,len] = genData();
%# define functions
f1 = @() func1(arr,len);
f2 = @() func2(arr,len);
f3 = @() func3(arr,len);
f4 = @() func4(arr,len);
%# timeit
t(1) = timeit( f1 );
t(2) = timeit( f2 );
t(3) = timeit( f3 );
t(4) = timeit( f4 );
%# return results to check their validity
v{1} = f1();
v{2} = f2();
v{3} = f3();
v{4} = f4();
end
function [arr,len] = genData()
%#arr = [1 2 3 4 5 6 7 8 9 10];
%#len = [1 3 2 1 3];
numArr = 10000; %# number of elements in array
numParts = 500; %# number of parts/regions
arr = rand(1,numArr);
len = zeros(1,numParts);
len(1:end-1) = diff(sort( randperm(numArr,numParts) ));
len(end) = numArr - sum(len);
end
function m = func1(arr, len)
%# @Drodbar: for-loop
idx = 1;
N = length(len);
m = zeros(1,N);
for i=1:N
m(i) = mean( arr(idx+(0:len(i)-1)) );
idx = idx + len(i);
end
end
function m = func2(arr, len)
%# @user1073959: MAT2CELL+CELLFUN
m = cellfun(@mean, mat2cell(arr, 1, len));
end
function m = func3(arr, len)
%# @Drodbar: ARRAYFUN+CELLFUN
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(len), len, 'UniformOutput',false);
m = cellfun(@(a) mean(arr(a)), idx);
end
function m = func4(arr, len)
%# @Amro: ACCUMARRAY
endLocs = cumsum(len(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);
m = accumarray(subs, arr) ./ len(:);
if isrow(len)
m = m';
end
end
以下是时间安排。测试在具有MATLAB R2012a的WinXP 32位机器上进行。我的方法比所有其他方法快一个数量级。 For-loop是第二好的。
>> [t,v] = testMeans();
>> t
t =
0.013098 0.013074 0.022407 0.00031807
| | | \_________ @Amro: ACCUMARRAY (!)
| | \___________________ @Drodbar: ARRAYFUN+CELLFUN
| \______________________________ @user1073959: MAT2CELL+CELLFUN
\__________________________________________ @Drodbar: FOR-loop
此外,所有结果都是正确且相等的 - 差异的大小为eps
机器精度(由不同的累积误差累积方式引起),因此被认为是垃圾并且被忽略了:
%#assert( isequal(v{:}) )
>> maxErr = max(max( diff(vertcat(v{:})) ))
maxErr =
3.3307e-16
答案 1 :(得分:0)
以下是使用arrayfun
和cellfun
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];
% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);
您想要的输出结果:
means =
1.0000 3.0000 5.5000 7.0000 9.0000
关注@tmpearce评论我在上面的解决方案之间进行了快速的性能比较,我从中创建了一个名为subsetMeans1
的函数
function means = subsetMeans1( zarray, lengths)
% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);
和一个简单的for循环替代方法,函数subsetMeans2
。
function means = subsetMeans2( zarray, lengths)
% Method based on single loop
idx = 1;
N = length(lengths);
means = zeros( 1, N);
for i = 1:N
means(i) = mean( zarray(idx+(0:lengths(i)-1)) );
idx = idx+lengths(i);
end
使用下一个基于TIMEIT的测试脚本,它允许检查性能,改变输入向量上的元素数量和每个子集的元素大小:
% Generate some data for the performance test
% Total of elements on the vector to test
nVec = 100000;
% Max of elements per subset
nSubset = 5;
% Data generation aux variables
lenghtsGen = randi( nSubset, 1, nVec);
accumLen = cumsum(lenghtsGen);
maxIdx = find( accumLen < nVec, 1, 'last' );
% % Original test data
% zarray = [1 2 3 4 5 6 7 8 9 10];
% lengths = [1 3 2 1 3];
% Vector to test
zarray = 1:nVec;
lengths = [ lenghtsGen(1:maxIdx) nVec-accumLen(maxIdx)] ;
% Double check that nVec is will be the max index
assert ( sum(lengths) == nVec)
t1(1) = timeit(@() subsetMeans1( zarray, lengths));
t1(2) = timeit(@() subsetMeans2( zarray, lengths));
fprintf('Time spent subsetMeans1: %f\n',t1(1));
fprintf('Time spent subsetMeans2: %f\n',t1(2));
事实证明,没有arrayfun
和cellfun
的非矢量化版本更快,可能是由于这些函数的额外开销
Time spent subsetMeans1: 2.082457
Time spent subsetMeans2: 1.278473