Question

我有两个载体

data vector: A = [1 2 2 1 2 6; 2 3 2 3 3 5]
label vector: B = [1 2 1 2 3 NaN]

我想取所有具有相同标签的列的平均值，并将这些列作为按标签编号排序的矩阵输出，忽略NaN。所以，在这个例子中我想要：

labelmean(A,B) = [1.5 1.5 2; 2 3 3]

这可以通过这样的for循环来完成。

function out = labelmean(data,label)
out=[];
for i=unique(label)
    if isnan(i); continue; end
    out = [out, mean(data(:,label==i),2)];
end

但是，我正在处理包含许多数据点和标签的巨大数组。此外，此代码段将经常执行。我想知道是否有更有效的方法来做到这一点，而不会遍历每个标签。

Answer 1

这是使用accumarray的好例子。将accumarray视为微型MapReduce范例。有键和值，所以accumarray的工作是将所有共享相同键的值组合在一起，并对这些值执行某些操作。在您的情况下，键将是B中的元素，但值将是B中相应值所需的行位置。基本上，对于B中的每个值，B中的位置会告诉您需要在A中访问哪一行。因此，我们只需要获取映射到相同ID的所有行位置，访问A行，然后查找所有行的平均值。我们需要小心，因为我们忽略了NaN的值。我们可以在调用accumarray之前将其过滤掉。＆＃34;东西＆＃34;您在accumarray传统上应该输出一个数字，但我们实际上是为每个标签输出一个列向量。因此，一个技巧是将输出包装到单元格数组中，然后使用cat结合逗号分隔列表将输出转换为矩阵。

因此，这样的事情应该有效：

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Find non-NaN locations
mask = ~isnan(B);

% Generate row locations that are not NaN as well as the labels
ind = 1 : numel(B);
Bf = B(mask).';
ind = ind(mask).';

% Find label-wise means
C = accumarray(Bf, ind, [], @(x) {mean(A(:,x), 2)});

% Convert to numeric matrix
out = cat(2, C{:});

如果您不想使用临时变量来查找那些非NaN值，我们可以使用较少的代码行来完成此操作，但您仍需要行索引的向量确定我们需要从哪里采样：

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Solution
ind = 1 : numel(B);
C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
out = cat(2, C{:});

根据您的数据，我们得到：

>> out

out =

    1.5000    1.5000    2.0000
    2.0000    3.0000    3.0000

Answer 2

这是一种方法：

获取不包含NaN s。
创建一个零稀疏矩阵，乘以A的矩阵将得到所需的行总和。
将该矩阵除以每列的总和，以使总和成为平均值。
应用矩阵乘法得到结果，并转换为完整矩阵。

代码：

I = find(~isnan(B));                                 % step 1
t = sparse(I, B(I), 1, size(A,2), max(B(I)));        % step 2
t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
result = full(A*t);                                  % step 4

Answer 3

这个答案不是一种新方法，而是给定答案的基准，因为如果你谈论性能，你总是要对它进行基准测试。

clear all;
% I tried to make a real-life dataset (the original author may provide a
% better one)
A = [1:3e4; 1:10:3e5; 1:100:3e6]; % large dataset
B = repmat(1:1e3, 1, 3e1); % large number of labels

labelmean(A,B);
labelmeanLuisMendoA(A,B);
labelmeanLuisMendoB(A,B);
labelmeanRayryeng(A,B);

function out = labelmean(data,label)
    tic
    out=[];
    for i=unique(label)
        if isnan(i); continue; end
        out = [out, mean(data(:,label==i),2)];
    end
    toc
end

function out = labelmeanLuisMendoA(A,B)
    tic
    B2 = B(~isnan(B)); % remove NaN's
    t = full(sparse(1:numel(B2),B2,1,size(A,2),max(B2))); % template matrix
    out = A*t; % sum of columns that share a label
    out = bsxfun(@rdivide, out, sum(t,1)); % convert sum into mean
    toc
end

function out = labelmeanLuisMendoB(A,B)
    tic
    B2 = B(~isnan(B));                                   % step 1
    t = sparse(1:numel(B2), B2, 1, size(A,2), max(B2));  % step 2
    t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
    out = full(A*t);                                  % step 4
    toc
end

function out = labelmeanRayryeng(A,B)
    tic
    ind = 1 : numel(B);
    C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
    out = cat(2, C{:});
    toc
end

输出结果为：

Elapsed time is 0.080415 seconds. % original
Elapsed time is 0.088427 seconds. % LuisMendo original answer
Elapsed time is 0.004223 seconds. % LuisMendo optimised version
Elapsed time is 0.037347 seconds. % rayryeng answer

对于这个数据集，LuisMendo优化版本是明显的赢家，而他的第一个版本比原始版本慢。

=＆GT;不要忘记对您的表现进行基准测试！

编辑： 测试平台规范

Matlab R2016b
Ubuntu 64位
15.6 GiB RAM
英特尔®酷睿™i7-5600U CPU @ 2.60GHz×4

具有相同标签

3 个答案: