我需要计算10 ^ 6x10 ^ 6的相关矩阵。为此,我使用了mapreduce
和来自the official documentation的说明。
我的主要剧本:
ds=datastore('\data\','Type','Tall');
outds=mapreduce(ds,@covarianceMapper, @covarianceReducer);
results = readall(outds);
CorrelationMatrix = results.Value{1};
映射器功能:
function covarianceMapper(t,~,intermKVStore)
x=t;
n = size(x,1);
m = mean(x,1);
c = cov(x,1);
add(intermKVStore,'key',{n m c})
end
减速机功能:
function covarianceReducer(~,intermValIter,outKVStore)
n1 = 0; % no rows so far
m1 = 0; % mean so far
c1 = 0; % covariance so far
while hasnext(intermValIter)
% Get the next chunk, and extract the count, mean, and covariance
t = getnext(intermValIter);
n2 = t{1};
m2 = t{2};
c2 = t{3};
% Use weighting formulas to update the values so far
n = n1+n2; % new count
m = (n1*m1 + n2*m2) / n; % new mean
% New covariance is a weighted combination of the two covariance, plus
% additional terms that relate to the difference in means
c1 = (n1*c1 + n2*c2 + n1*(m1-m)'*(m1-m) + n2*(m2-m)'*(m2-m))/ n;
% Store the new mean and count for the next iteration
m1 = m;
n1 = n;
end
% Save results in the output key/value store
s=sqrt(diag(c1));
add(outKVStore,'correlation',c1./(s*s'));
end
问题是ds
的大小是10 ^ 6(列)X 10 ^ 3(行)。产生的相关矩阵很大(10 ^ 6x10 ^ 6),我在减速器步骤中遇到内存问题。显然,如此巨大的结果不适合我的内存,我有一个错误。我知道我应该玩高大的数组,但简单的tall(mapreduce())
没有帮助。
作为专为大数据设计的mapreduce()
,当输出也很大时,它们应该是一个解决方案。你能帮我找到这样的解决方案吗?