我有一个矩阵M
。我们假设矩阵M
的每一行都是一个主题,每列都是一个度量。
M=rand(100); % generate a 100x100 matrix random
c=randperm(length(M),100); %select randomly 100 measurement indices
r=randperm(length(M),100); %select randomly 100 subject indices
for i = 1 : 100
M(r(i),c(i))=NaN; % add randomly NaN. i.e. the subject c(i) does not have measurement c(i)
end
现在我删除了所有主题(如果有的话)缺失的测量值
idx_col_all_NAN = find(all(isnan(M)==1));
M(:,idx_col_all_NAN)=[];
我删除了缺少所有测量值的主题(如果有的话)
idx_row_all_NAN = find(all(isnan(M)==1,2));
M(idx_row_all_NAN,:)=[];
现在我想删除测量结果,以便最大化具有相同测量值的受试者数量,并最小化含有NaN的M
细胞。
你能帮帮我吗?
答案 0 :(得分:0)
为了继续从矩阵中删除NaN,您需要制定一些规则,以便最大限度地减少数据量和NaN之间的权衡。正如您所说,如果您继续删除NaN而没有任何限制 - 您可能会保留非常少量的数据。没有正确的规则,这实际上取决于您的要求,以下建议只是为了让您了解如何处理此类问题。
因此,作为一个起点,我定义了一个“质量”的索引。矩阵的数量,就多少个“孔”而言。在其中:
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
当矩阵中有更多数据时,此索引会更大,如果没有NaN,则该值等于1。只要我们看到一个改进,我们就可以继续从矩阵中删除行/列,但是因为矩阵越来越小,只要剩下NaN,我们总会看到一个改进,所以我们将留下空矩阵(或者非常小的,取决于我们有多少NaNs。
所以我们需要为改进定义一些阈值,这样如果删除没有改善矩阵的一定数量 - 我们就会停止这个过程:
improve = 1-M_old_ratio/M_new_ratio % the relative improvement after deletion
improve
是我们品质的相对收益'索引,如果它不够大,我们停止从矩阵中删除行/列。什么足够大?这很难说,但我会留给你玩,看看是什么给你一个不错的结果。
所以这是完整的代码:
N = 100;
M = rand(N); % generate a NxN random matrix
M(randi(numel(M),N^2,1)) = nan; % add NaN to randomly selected N^2 measurements
M(:,all(isnan(M)))=[]; % delete all NaN columns
M(all(isnan(M),2),:)=[]; % delete all NaN rows
threshold = 0.003; % the threshold for stop optimizing the matrix
while 1
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
[mincol,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
[minrow,indrow] = min(sum(~isnan(M),2)); % find the row with most NaN
[~,dir] = min([minrow;mincol]); % find which has more NaNs
Mtry = M;
if dir == 1
Mtry(indrow,:) = []; % delete row
else
Mtry(:,indcol) = []; % delete column
end
Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
if improve>threshold % if it improves more than the threshold
M = Mtry; % replace the matrix
else
break; % otherwise - quit
end
end
如果您只考虑删除列而不是行,则有点简单:
threshold = 0.002; % the threshold for stop optimizing the matrix
while 1
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
[~,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
Mtry = M;
Mtry(:,indcol) = []; % delete column
Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
if improve>threshold % if it improves more than the threshold
M = Mtry; % replace the matrix
else
break; % otherwise - quit
end
end
正如您将注意到的那样,我以更紧凑的方式将NaN引入矩阵,但它并不重要,因为您拥有真实的数据。我还使用逻辑索引,这是删除列和行的更紧凑和有效的方法。