Question

我有一个带有数十行和数千列的矩阵X，所有元素都是分类的，并重新组织成索引矩阵。例如，ith列X(:,i) = [-1,-1,0,2,1,2]'已转换为X2(:,i) = ic [x,ia,ic] = unique(X(:,i))，以方便使用函数accumarray。我从矩阵中随机选择了一个子矩阵，并计算了子矩阵每列的唯一值的数量。我执行了这个程序10,000次。我知道几种计算列中唯一值数的方法，到目前为止我发现的禁食方式如下所示：

mx = max(X);
for iter = 1:numperm
    for j = 1:ny
        ky = yrand(:,iter)==uy(j);
        % select submatrix from X where all rows correspond to rows in y that y equals to uy(j)
        Xk = X(ky,:);
        % specify the sites where to put the number of each unique value
        mxj = mx*(j-1);
        mxi = mxj+1;
        mxk = max(Xk)+mxj;
        % iteration to count number of unique values in each column of the submatrix
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(Xk(:,i),1);
        end
    end
end

这是一种执行随机排列测试以计算大小为X的数据矩阵n by c与分类变量y之间的信息增益的方法，其中y是随机的置换。在上面的代码中，所有随机排列的y都存储在矩阵yrand中，并且排列的数量为numperm。 y的唯一值存储在uy中，唯一编号为ny。在1:numperm的每次迭代中，根据Xk的唯一元素选择子矩阵y，并且计算该子矩阵的每列中的唯一元素的数量并将其存储在矩阵{{1}中}。

上述代码中代码最耗时的部分是pxs对大i = 1:c的迭代。

是否可以以矩阵方式执行函数c以避免accumarray循环？我还能如何改进上述代码？

-------

根据要求，提供包括上述代码的简化测试功能

for

和测试数据

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(x2);
pxs(max(mx),c) = 0;
for iter = 1:numperm
    for j = 1:nuy
        ky = yrand(:,iter)==uy(j);
        xk = x2(ky,:);
        mxj = mx*(j-1);
        mxk = max(xk)+mxj;
        mxi = mxj+1;
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(xk(:,i),1);
        end
    end
end

测试功能

x = round(randn(60,3000));
y = [ones(30,1);ones(30,1)*-1];

在我的计算机中返回tic; test(x,y); toc。在测试功能中，设置1000个排列。因此，如果我执行10,000次排列并进行一些额外的计算（与上面的代码相比可以忽略不计），则预期时间超过Elapsed time is 15.391628 seconds.。我认为代码是否可以改进。直观地，以矩阵方式执行150 s可以节省大量时间。我可以吗？

Answer 1

@ rahnema1建议的方式显着改善了计算，所以我在这里发布了我的答案，也是@ Dev-iL的要求。

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(max(x2));
% preallocation
pxs(mx*nuy,c) = 0;
% set the edges of the bin for function histc
binrg = (1:mx)';
% preallocation of the range of matrix into which the results will be stored
mxr = mx*(0:nuy);
for iter = 1:numperm
    yt = yrand(:,iter);
    for j = 1:nuy
        pxs(mxr(j)+1:mxr(j),:) = histc(x2(yt==uy(j)),binrg);
    end
end

测试结果：

>> x = round(randn(60,3000));
>> y = [ones(30,1);ones(30,1)*-1];
>> tic; test(x,y); toc
Elapsed time is 15.632962 seconds.
>> tic; test(x,y); toc % using the way suggested by rahnema1, i.e., revised function posted above
Elapsed time is 2.900463 seconds.

以快速方式计算子矩阵的每列的唯一值的数量

1 个答案: