Question

我通常需要总结具有给定聚合函数的不规则定时的时间序列（即，求和，平均等）。但是，我目前的解决方案效率低而且速度慢。

采用聚合功能：

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
    for iSer = 1:size(array, 2)
      aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
    end
end

end

请注意，array和groupIndex都可以是2D。 array中的每一列都是要聚合的独立系列，但groupIndex的列应合在一起（作为一行）以指定句点。

然后当我们给它带来一个不规则的时间序列时（注意第二个周期是一个基本周期更长），时间结果很差：

a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.

使用分析器，我们可以发现grpIdx行大约占执行时间的1/4（.28 s），而iSer循环需要大约3/4（1.17 s）总计（1.48秒）。

将此与无关紧要的时期相比较：

tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.

是否有更有效的方法来汇总这些数据？

时间安排结果

将每个响应放在一个单独的函数中，以下是我在Windows 7上使用带有Intel i7的Matlab 2015b timeit获得的时序结果：

    original | 1.32451
      felix1 | 0.35446
      felix2 | 0.16432
    divakar1 | 0.41905
    divakar2 | 0.30509
    divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977

对`groupIndex`

的澄清

2D groupIndex的一个例子是为1980 - 2015年的一组日常数据指定年份编号和周数：

a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];

因此，＆＃34;年周＆＃34;句点由groupIndex行唯一标识。这可以通过调用unique(groupIndex, 'rows')并获取第三个输出来有效处理，因此请随意忽略这部分问题。

Answer 1

方法＃1

您可以在所有内容中创建与grIdx对应的蒙版 groups与bsxfun(@eq,..)一起collapseFn。现在，对于@sum作为M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = M.'*array，您可以引入matrix-multiplication，从而拥有完全向量化的方法，就像这样 -

collapseFn

对于@mean M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array，您需要做更多工作，如下所示 -

collapseFn

方法＃2

如果您使用的是通用M，则可以使用先前方法创建的2D掩码array来索引O(n^2)行，从而改变复杂性O(n)到n = size(groups,1); M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)); out = zeros(n,size(array,2)); for iGr = 1:n out(iGr,:) = collapseFn(array(M(:,iGr),:),1); end。一些快速测试表明，这可以比原始的loopy代码提供明显的加速。这是实施 -

请注意collapseFn(array(M(:,iGr),:),1)中的collapseFn表示应用1的维度，因此groupIndex必不可少。

<强>加成

名称M似乎会保留整数值，可以滥用，通过考虑groupIndex的每一行来创建更高效的groupIndex作为索引元组，从而将groupIndex的每一行转换为标量，最后获得0(n)的一维数组版本。这必须更高效，因为数据大小现在是M。这个M可以用于本文中列出的所有方法。所以，我们会dims = max(groupIndex,[],1); agg_dims = cumprod([1 dims(end:-1:2)]); [~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//' m = size(groupIndex,1); M = false(m,max(idx)); M((idx-1)*m + [1:m]') = 1;这样 -

R_HOME C:\Program Files\R\R-3.2.0\
R_USER C:\Anaconda\Lib\site-packages\rpy2\

Answer 2

Mex功能1

HAMMER TIME：Mex function to crush it：使用问题原始代码的基本案例测试在我的机器上花了1.334139秒。恕我直言，2nd fastest answer from @Divakar是：

groups2 = unique(groupIndex); 
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;

经过的时间是0.589330秒。

然后是我的MEX功能：

[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));

经过的时间是0.079725秒。

测试我们得到相同的答案：norm(groups2-groups3)返回0，norm(aggArray2 - aggArray3)返回2.3959e-15。结果也符合原始代码。

生成测试条件的代码：

array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

对于纯粹的速度，请转到mex。如果编译c ++代码/复杂性的想法太痛苦了，那就去看看Divakar的答案吧。另一个免责声明：我没有对我的功能进行强有力的测试。

Mex方法2

有点令我惊讶的是，在某些情况下，此代码看起来比完整的Mex版本更快（例如，在此测试中大约需要0.05秒）。它使用mex function mg_getRowsWithKey来计算组的索引。我想这可能是因为我在完整mex函数中复制的数组并不是它可能的速度和/或来自调用＆＃39; feval＆＃39;的开销。它与其他版本的算法复杂度基本相同。

[unique_groups, map] = mg_getRowsWithKey(groupIndex);

results = zeros(length(unique_groups), size(array,2));

for iGr = 1:length(unique_groups)
   array_subset             = array(map{iGr},:);

   %// do your collapse function on array_subset. eg.
   results(iGr,:)           = sum(array_subset, 1);
end

当您array(groups(1)==groupIndex,:)提取与整个群组相关联的数组条目时，您需要搜索整个群组的整体长度。如果您有数百万行输入，这将完全糟糕。 array(map{1},:)效率更高。

仍有不必要的内存复制以及与呼叫＆＃39; feval＆＃39;相关的其他开销。关于崩溃功能。如果你在c ++中有效地实现聚合器函数以避免复制内存，那么可能会实现另外2倍的加速。

Answer 3

派对有点晚，但使用accumarray的单个循环会产生巨大的差异：

function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)

  [groups, ~, index] = unique(groupIndex, 'rows');
  numCols = size(array, 2);
  aggArray = nan(numel(groups), numCols);
  for col = 1:numCols
    aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
  end

end

使用MATLAB R2016b中的timeit对问题中的样本数据进行计时，得出以下结论：

original | 1.127141
 gnovice | 0.002205

超过500倍的加速！

Answer 4

取消内循环，即

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
   aggArray(iGr,:) = collapseFn(array(grIdx,:));
end

并使用维度参数

调用collapse函数

res=aggregate(a, b, @(x)sum(x,1));

已经提供了一些加速（在我的机器上是3倍）并避免了错误，例如sum或mean产生，当他们遇到没有维度参数的单行数据，然后在列而不是标签上折叠。

如果您只有一个组标签向量，即所有数据列的相同组标签，您可以进一步加快速度：

function aggArray = aggregate(array, groupIndex, collapseFn)

ng=max(groupIndex);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
end

后面的函数为您的示例提供了相同的结果，加速比为6倍，但无法处理每个数据列的不同组标签。

假设组索引的2D测试用例（此处提供了groupIndex的10个不同列：

a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
      n0=randi(10);
      b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
      B=[B b];
end
tic; erg0=aggregate(a, B, @sum); toc % original method 
tic; erg1=aggregate2(a, B, @(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, @(x)sum(x,1)); toc %use function below

经过的时间是2.646297秒。经过的时间是1.214365秒。经过的时间是0.039678秒（!!!!）。

function aggArray = aggregate3(array, groupIndex, collapseFn)

[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');

ng=size(groups,1);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
end

我认为这与没有使用MEX的速度一样快。感谢Matthew Gunn的建议！分析显示“独特”和“＃39;在这里真的很便宜，只是在groupIndex中重复行的第一个和最后一个索引，可以大大加快速度。通过聚合的迭代，我获得了88倍的加速。

Answer 5

我有一个几乎与mex一样快但只能使用matlab的解决方案。逻辑与上面的大部分内容相同，创建了一个虚拟的2D矩阵，但不是使用@eq，而是从头开始初始化逻辑数组。

我的经历时间是0.172975秒。迪瓦卡经过的时间为0.289122秒。

function aggArray = aggregate(array, group, collapseFn)
    [m,~] = size(array);
    n = max(group);
    D = false(m,n); 
    row = (1:m)';
    idx = m*(group(:) - 1) + row;
    D(idx) = true;
    out = zeros(m,size(array,2));
    for ii = 1:n
        out(ii,:) = collapseFn(array(D(:,ii),:),1);
    end
end

时间序列聚合效率

时间安排结果

对`groupIndex`

5 个答案:

Mex功能1

Mex方法2

时间序列聚合效率

时间安排结果

对groupIndex

5 个答案:

Mex功能1

Mex方法2

对`groupIndex`