使用NaN

时间:2016-09-14 15:01:09

标签: matlab optimization matrix missing-data

我有一个矩阵M。我们假设矩阵M的每一行都是一个主题,每列都是一个度量。

M=rand(100);                % generate a 100x100 matrix random
c=randperm(length(M),100);  %select randomly 100 measurement indices
r=randperm(length(M),100);  %select randomly 100 subject indices

for i = 1 : 100
    M(r(i),c(i))=NaN;       % add randomly NaN. i.e. the subject c(i) does not have measurement c(i)
end

现在我删除了所有主题(如果有的话)缺失的测量值

idx_col_all_NAN = find(all(isnan(M)==1));   
M(:,idx_col_all_NAN)=[];

我删除了缺少所有测量值的主题(如果有的话)

idx_row_all_NAN = find(all(isnan(M)==1,2));   
M(idx_row_all_NAN,:)=[];

现在我想删除测量结果,以便最大化具有相同测量值的受试者数量,并最小化含有NaN的M细胞。

你能帮帮我吗?

1 个答案:

答案 0 :(得分:0)

为了继续从矩阵中删除NaN,您需要制定一些规则,以便最大限度地减少数据量和NaN之间的权衡。正如您所说,如果您继续删除NaN而没有任何限制 - 您可能会保留非常少量的数据。没有正确的规则,这实际上取决于您的要求,以下建议只是为了让您了解如何处理此类问题。

因此,作为一个起点,我定义了一个“质量”的索引。矩阵的数量,就多少个“孔”而言。在其中:

M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size

当矩阵中有更多数据时,此索引会更大,如果没有NaN,则该值等于1。只要我们看到一个改进,我们就可以继续从矩阵中删除行/列,但是因为矩阵越来越小,只要剩下NaN,我们总会看到一个改进,所以我们将留下空矩阵(或者非常小的,取决于我们有多少NaNs。

所以我们需要为改进定义一些阈值,这样如果删除没有改善矩阵的一定数量 - 我们就会停止这个过程:

improve = 1-M_old_ratio/M_new_ratio % the relative improvement after deletion

improve是我们品质的相对收益'索引,如果它不够大,我们停止从矩阵中删除行/列。什么足够大?这很难说,但我会留给你玩,看看是什么给你一个不错的结果。

所以这是完整的代码:

N = 100;
M = rand(N); % generate a NxN random matrix
M(randi(numel(M),N^2,1)) = nan;  % add NaN to randomly selected N^2 measurements
M(:,all(isnan(M)))=[]; % delete all NaN columns
M(all(isnan(M),2),:)=[]; % delete all NaN rows
threshold = 0.003; % the threshold for stop optimizing the matrix
while 1
    M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
    [mincol,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
    [minrow,indrow] = min(sum(~isnan(M),2)); % find the row with most NaN
    [~,dir] = min([minrow;mincol]); % find which has more NaNs
    Mtry = M;
    if dir == 1
        Mtry(indrow,:) = []; % delete row
    else
        Mtry(:,indcol) = []; % delete column
    end
    Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
    improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
    if improve>threshold % if it improves more than the threshold
        M = Mtry; % replace the matrix
    else
        break; % otherwise - quit
    end
end

如果您只考虑删除列而不是行,则有点简单:

threshold = 0.002; % the threshold for stop optimizing the matrix
while 1
    M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
    [~,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
    Mtry = M;
    Mtry(:,indcol) = []; % delete column
    Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
    improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
    if improve>threshold % if it improves more than the threshold
        M = Mtry; % replace the matrix
    else
        break; % otherwise - quit
    end
end

正如您将注意到的那样,我以更紧凑的方式将NaN引入矩阵,但它并不重要,因为您拥有真实的数据。我还使用逻辑索引,这是删除列和行的更紧凑和有效的方法。