Question

我有矩阵X（100000 X 10）和向量Y（100000 X 1）。 X行是分类的，假设值为1到5，标签也是分类的（11到20）;

X行是重复的，只有约25％的唯一行，我希望Y具有特定唯一行的所有标签的统计模式。

然后又出现了另一个数据集P（90000 X 10），我想根据之前的练习预测标签Q.

我尝试的是在MATLAB中使用唯一的X找到唯一的X行，然后为唯一的行分配每个标签的统计模式。对于P，我可以使用ismember并执行相同的操作。

问题在于数据集的大小，完成此过程需要1.5-2小时。在MATLAB中是否可以使用矢量化版本？

这是我的代码：

[X_unique,~,ic] = unique(X,'rows','stable');
labels=zeros(length(X_unique),1);
for i=1:length(X_unique)
    labels(i)=mode(Y(ic==i));
end

Q=zeros(length(P),1);
for j=1:length(X_unique)
    Q(all(repmat(X_unique(j,:),length(P),1)==P,2))=label(j);
end

Answer 1

如果你完全用以下代码替换它，你将能够加速你的第一个循环：

labels = accumarray(ic, Y, [], @(y) mode(y));

使用all(bsxfun(@eq, X_unique(i,:), P), 2)内的Q(...)可以加速第二个循环。这是一个很好的矢量化方法，假设您的阵列不是非常大w.r.t.机器上的可用内存。此外，为了节省更多时间，您可以使用unique X对P [P_unique, ~, IC_P] = unique(P, 'rows', 'stable');进行操作，在更小的数组上运行所有比较：

Q_unique

修改： 以下列方式计算Q_unique = zeros(length(P_unique),1); for i = 1:length(X_unique) Q_unique(all(bsxfun(@eq, X_unique(i,:), P_unique), 2)) = labels(i) end：~~然后使用以下命令将其转换回完整数组：~~

Q_full

并转换回P以匹配原始Q_full = Q_unique(IC_P);输入：

for i = 1:length(X_unique)
    idx = true(length(P), 1);
    for j = 1:size(X_unique,2)
        idx = idx & (X_unique(i,j) == P(:,j));
    end
    Q(idx) = labels(i);
%    Q(all(bsxfun(@eq, X_unique(i,:), P), 2)) = labels(i);
end

结束编辑

最后，如果内存是一个问题，除了上面的所有内容之外，您可能希望在第二个循环中使用半矢量化方法：

x3

与bsxfun相比，这需要大约containers.Map次，但如果记忆力有限，那么你需要付出速度。

另一个编辑

根据您的Matlab版本，您还可以通过将数字序列的文本表示映射到计算的labels来使用% find unique members of X to work with a smaller array [X_unique, ~, IC_X] = unique(X, 'rows', 'stable'); % compute labels labels = accumarray(IC_X, Y, [], @(y) mode(y)); % convert X to cellstr -- textual representation of the number sequence X_cellstr = cellstr(char(X_unique+48)); % 48 is ASCII for 0 % map each X to its label X_map = containers.Map(X_cellstr, labels); % find unique members of P to work with a smaller array [P_unique, ~, IC_P] = unique(P, 'rows', 'stable'); % convert P to cellstr -- textual representation of the number sequence P_cellstr = cellstr(char(P_unique+48)); % 48 is ASCII for 0 % --- EDIT --- avoiding error on missing keys in X_map -------------------- % find which P's exist in map isInMapP = X_map.isKey(P_cellstr); % pre-allocate Q_unique to the size of P_unique (can be any value you want) Q_unique = nan(size(P_cellstr)); % NaN is safe to use since not a label % find the labels for each P_unique that exists in X_map Q_unique(isInMapP) = cell2mat(X_map.values(P_cellstr(isInMapP))); % --- END EDIT ------------------------------------------------------------ % convert back to full Q array to match original P Q_full = Q_unique(IC_P);。见下面的例子。

mode

在我的笔记本电脑上运行大约需要15秒。其中大部分是通过计算var sum = function(array, i) { if(array.length === 0){ return 0; } console.log(array[i]); if(i === array.length-1){ return array[i]; } return array[i] + sum(array, i+1); }; console.log(sum([1, 2, 3, 4, 5, 6],0)) //21消耗的。

根据大型数据集的给定示例有效地分配标签

1 个答案: