Question

我有一个矩阵M。我们假设矩阵M的每一行都是一个主题，每列都是一个度量。

M=rand(100);                % generate a 100x100 matrix random
c=randperm(length(M),100);  %select randomly 100 measurement indices
r=randperm(length(M),100);  %select randomly 100 subject indices

for i = 1 : 100
    M(r(i),c(i))=NaN;       % add randomly NaN. i.e. the subject c(i) does not have measurement c(i)
end

现在我删除了所有主题（如果有的话）缺失的测量值

idx_col_all_NAN = find(all(isnan(M)==1));   
M(:,idx_col_all_NAN)=[];

我删除了缺少所有测量值的主题（如果有的话）

idx_row_all_NAN = find(all(isnan(M)==1,2));   
M(idx_row_all_NAN,:)=[];

现在我想删除测量结果，以便最大化具有相同测量值的受试者数量，并最小化含有NaN的M细胞。

你能帮帮我吗？

Answer 1

为了继续从矩阵中删除NaN，您需要制定一些规则，以便最大限度地减少数据量和NaN之间的权衡。正如您所说，如果您继续删除NaN而没有任何限制 - 您可能会保留非常少量的数据。没有正确的规则，这实际上取决于您的要求，以下建议只是为了让您了解如何处理此类问题。

因此，作为一个起点，我定义了一个“质量”的索引。矩阵的数量，就多少个“孔”而言。在其中：

M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size

当矩阵中有更多数据时，此索引会更大，如果没有NaN，则该值等于1。只要我们看到一个改进，我们就可以继续从矩阵中删除行/列，但是因为矩阵越来越小，只要剩下NaN，我们总会看到一个改进，所以我们将留下空矩阵（或者非常小的，取决于我们有多少NaNs。

所以我们需要为改进定义一些阈值，这样如果删除没有改善矩阵的一定数量 - 我们就会停止这个过程：

improve = 1-M_old_ratio/M_new_ratio % the relative improvement after deletion

improve是我们品质的相对收益＆＃39;索引，如果它不够大，我们停止从矩阵中删除行/列。什么足够大？这很难说，但我会留给你玩，看看是什么给你一个不错的结果。

所以这是完整的代码：

N = 100;
M = rand(N); % generate a NxN random matrix
M(randi(numel(M),N^2,1)) = nan;  % add NaN to randomly selected N^2 measurements
M(:,all(isnan(M)))=[]; % delete all NaN columns
M(all(isnan(M),2),:)=[]; % delete all NaN rows
threshold = 0.003; % the threshold for stop optimizing the matrix
while 1
    M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
    [mincol,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
    [minrow,indrow] = min(sum(~isnan(M),2)); % find the row with most NaN
    [~,dir] = min([minrow;mincol]); % find which has more NaNs
    Mtry = M;
    if dir == 1
        Mtry(indrow,:) = []; % delete row
    else
        Mtry(:,indcol) = []; % delete column
    end
    Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
    improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
    if improve>threshold % if it improves more than the threshold
        M = Mtry; % replace the matrix
    else
        break; % otherwise - quit
    end
end

如果您只考虑删除列而不是行，则有点简单：

threshold = 0.002; % the threshold for stop optimizing the matrix
while 1
    M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
    [~,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
    Mtry = M;
    Mtry(:,indcol) = []; % delete column
    Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
    improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
    if improve>threshold % if it improves more than the threshold
        M = Mtry; % replace the matrix
    else
        break; % otherwise - quit
    end
end

正如您将注意到的那样，我以更紧凑的方式将NaN引入矩阵，但它并不重要，因为您拥有真实的数据。我还使用逻辑索引，这是删除列和行的更紧凑和有效的方法。

使用NaN

1 个答案: