Question

我正在尝试优化值 N 以将数组拆分为矢量化数组，以便它在不同的机器上运行最快。我在下面有一些测试代码

#example use random values
clear all,
t=rand(1,556790);
inner_freq=rand(8193,6);

N=100; # use N chunks
nn = int32(linspace(1, length(t)+1, N+1))
aa_sig_combined=zeros(size(t));
total_time_so_far=0;
for ii=1:N
    tic;
    ind = nn(ii):nn(ii+1)-1;
    aa_sig_combined(ind) = sum(diag(inner_freq(1:end-1,2)) * cos(2 .* pi .* inner_freq(1:end-1,1) * t(ind)) .+ repmat(inner_freq(1:end-1,3),[1 length(ind)]));
    toc
    total_time_so_far=total_time_so_far+sum(toc)
end
fprintf('- Complete  test in %4.4fsec or %4.4fmins\n',total_time_so_far,total_time_so_far/60);

在运行ubuntu的16gig i7机器上N = 100时，需要162.7963秒或2.7133分钟才能完成

有没有办法找出 N 应该是什么值才能让它在不同的机器上运行得最快？

PS：我在16gig i7 ubuntu 14.04上运行Octave 3.8.1，但它也将在1 gig raspberry pi 2上运行。

Answer 1

这是我用来为每个参数计时的Matlab测试脚本。返回用于在第一次迭代后中断它，因为看起来其余的迭代是相似的。

%example use random values
clear all;
t=rand(1,556790);
inner_freq=rand(8193,6);

N=100; % use N chunks
nn = int32( linspace(1, length(t)+1, N+1) );
aa_sig_combined=zeros(size(t));

D = diag(inner_freq(1:end-1,2));
for ii=1:N
    ind = nn(ii):nn(ii+1)-1;
    tic;
    cosPara = 2 * pi * A * t(ind);
    toc;
    cosResult = cos( cosPara );
    sumParaA = D * cosResult;
    toc;
    sumParaB = repmat(inner_freq(1:end-1,3),[1 length(ind)]);
    toc;
    aa_sig_combined(ind) = sum( sumParaA + sumParaB );
    toc;
    return;
end

输出如下所示。请注意，我的电脑速度很慢。

Elapsed time is 0.156621 seconds.
Elapsed time is 17.384735 seconds.
Elapsed time is 17.922553 seconds.
Elapsed time is 18.452994 seconds.

正如您所看到的，cos操作需要花费很长时间。您正在8192x5568矩阵（45,613,056个元素）上运行cos，这需要花费很长时间。

如果您希望提高性能，请使用parfor，因为每次迭代都是独立的。假设您有100个核心来运行100次迭代，那么您的脚本将在 17 秒+ parfor开销中完成。

在cos计算中，您可能想要查看是否存在另一种方法来计算值的cos值，并且比stock方法更平行。

另一个小优化是这一行。它确保在对角矩阵不变的情况下不在循环内调用diag函数。您不希望每次都生成8192x8192对角矩阵！我只是将它存储在循环之外，它也提供了一些性能提升。

D = diag(inner_freq(1:end-1,2));

请注意，我没有使用Matlab配置文件，因为它对我不起作用，但您将来应该使用它来获得更多功能化代码。

优化值N以将数组拆分为矢量化数组，使其运行速度最快

1 个答案: