Question

我正在尝试处理一个非常大的数据集。我有k = ~4200矩阵（不同大小），必须进行组合比较，跳过非唯一和自我比较。 k（k-1）/ 2比较中的每一个产生矩阵，该矩阵必须与其父亲索引（即可以找出它来自何处）。这样做的便捷方法是（三角形）用每个比较的结果填充k-by-k单元阵列。这些平均约为100 X~100个矩阵。使用单精度浮子，整体可达400 GB 我需要1）生成单元格数组或其中的部分而不试图将整个内容放在内存中2）以类似的方式访问其元素（以及它们的元素）。由于依赖MATLAB的eval()以及循环中出现的save和clear，我的尝试效率很低。

for i=1:k
    [~,m] = size(data{i});
    cur_var = ['H' int2str(i)];
    %# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
    eval([cur_var ' = cell(1,k-i);']);
    for j=i+1:k
        [~,n] = size(data{j});
        eval([cur_var '{i,j} = zeros(m,n,''single'');']);
        eval([cur_var '{i,j} = compare(data{i},data{j});']);
    end
    save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
    clear(cur_var);
end

我做的另一件事是在mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0时执行拆分。这将结果分成最大数量的相同大小的部分，这似乎是合乎逻辑的。索引稍微复杂一点，但也不算太糟糕，因为可以使用线性索引。

有谁知道/看到更好的方法？

Answer 1

您可以通过单独分配文件名来摆脱eval和clear来电。

for i=1:k
    [~,m] = size(data{i});
    file_name = ['H' int2str(i)];    
    cur_var = cell(1, k-i);
    for j=i+1:k
        [~,n] = size(data{j});
        cur_var{i,j} = zeros(m, n, 'single');
        cur_var{i,j} = compare(data{i}, data{j});
    end
    save(file_name, cur_var); 
end

如果您需要保存的变量采用不同的名称，请使用-struct选项进行保存。

str.(file_name);
save(file_name, '-struct', str);

Answer 2

这是一个快速结合使用最少内存的版本。

我使用fwrite / fread，这样你仍然可以使用parfor（这一次，我确保它有效:)）

%# assume data is loaded an k is known

%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));

%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');

%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are 
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)

   %# make the comparison - do double b/c there shouldn't be a memory issue 
   currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});

   %# create save-name as H_i_j, e.g. H_104_23
   saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));

   %# save. Since 'save' is not allowed, use fwrite to write the data to disk
   fid = fopen(saveName,'w');

   %# for simplicity: save data as vector, add two elements to the beginning
   %# to store the size of the array
   fwrite(fid,[size(currentComparison)';currentComparison(:)]);  % ' #SO formatting

   %# close file
   fclose(fid)
end



%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);

%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));

用于寻址大型阵列的内存不足算法

2 个答案: