我正在尝试处理一个非常大的数据集。我有k = ~4200矩阵(不同大小),必须进行组合比较,跳过非唯一和自我比较。 k(k-1)/ 2比较中的每一个产生矩阵,该矩阵必须与其父亲索引(即可以找出它来自何处)。这样做的便捷方法是(三角形)用每个比较的结果填充k-by-k单元阵列。这些平均约为100 X~100个矩阵。使用单精度浮子,整体可达400 GB
我需要1)生成单元格数组或其中的部分而不试图将整个内容放在内存中2)以类似的方式访问其元素(以及它们的元素)。由于依赖MATLAB的eval()
以及循环中出现的save
和clear
,我的尝试效率很低。
for i=1:k
[~,m] = size(data{i});
cur_var = ['H' int2str(i)];
%# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
eval([cur_var ' = cell(1,k-i);']);
for j=i+1:k
[~,n] = size(data{j});
eval([cur_var '{i,j} = zeros(m,n,''single'');']);
eval([cur_var '{i,j} = compare(data{i},data{j});']);
end
save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
clear(cur_var);
end
我做的另一件事是在mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0
时执行拆分。这将结果分成最大数量的相同大小的部分,这似乎是合乎逻辑的。索引稍微复杂一点,但也不算太糟糕,因为可以使用线性索引。
有谁知道/看到更好的方法?
答案 0 :(得分:1)
您可以通过单独分配文件名来摆脱eval
和clear
来电。
for i=1:k
[~,m] = size(data{i});
file_name = ['H' int2str(i)];
cur_var = cell(1, k-i);
for j=i+1:k
[~,n] = size(data{j});
cur_var{i,j} = zeros(m, n, 'single');
cur_var{i,j} = compare(data{i}, data{j});
end
save(file_name, cur_var);
end
如果您需要保存的变量采用不同的名称,请使用-struct
选项进行保存。
str.(file_name);
save(file_name, '-struct', str);
答案 1 :(得分:1)
这是一个快速结合使用最少内存的版本。
我使用fwrite / fread,这样你仍然可以使用parfor
(这一次,我确保它有效:))
%# assume data is loaded an k is known
%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));
%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');
%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)
%# make the comparison - do double b/c there shouldn't be a memory issue
currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});
%# create save-name as H_i_j, e.g. H_104_23
saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));
%# save. Since 'save' is not allowed, use fwrite to write the data to disk
fid = fopen(saveName,'w');
%# for simplicity: save data as vector, add two elements to the beginning
%# to store the size of the array
fwrite(fid,[size(currentComparison)';currentComparison(:)]); % ' #SO formatting
%# close file
fclose(fid)
end
%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);
%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));