Mapreduce结果不适合内存

时间:2017-11-24 12:43:34

标签: matlab mapreduce bigdata

我需要计算10 ^ 6x10 ^ 6的相关矩阵。为此,我使用了mapreduce和来自the official documentation的说明。

我的主要剧本:

ds=datastore('\data\','Type','Tall');  
outds=mapreduce(ds,@covarianceMapper, @covarianceReducer);  
results = readall(outds); 
CorrelationMatrix = results.Value{1};

映射器功能:

function covarianceMapper(t,~,intermKVStore)
    x=t;
    n = size(x,1);
    m = mean(x,1);
    c = cov(x,1);
    add(intermKVStore,'key',{n m c})
end

减速机功能:

function covarianceReducer(~,intermValIter,outKVStore)
n1 = 0; % no rows so far
m1 = 0; % mean so far
c1 = 0; % covariance so far

while hasnext(intermValIter)
    % Get the next chunk, and extract the count, mean, and covariance
    t = getnext(intermValIter);
    n2 = t{1};
    m2 = t{2};
    c2 = t{3};

    % Use weighting formulas to update the values so far
    n = n1+n2;                     % new count
    m = (n1*m1 + n2*m2) / n;       % new mean

    % New covariance is a weighted combination of the two covariance, plus
    % additional terms that relate to the difference in means
    c1 = (n1*c1 + n2*c2 + n1*(m1-m)'*(m1-m) + n2*(m2-m)'*(m2-m))/ n;

    % Store the new mean and count for the next iteration
    m1 = m;
    n1 = n;
end

% Save results in the output key/value store
s=sqrt(diag(c1));
add(outKVStore,'correlation',c1./(s*s'));
end

问题是ds的大小是10 ^ 6(列)X 10 ^ 3(行)。产生的相关矩阵很大(10 ^ 6x10 ^ 6),我在减速器步骤中遇到内存问题。显然,如此巨大的结果不适合我的内存,我有一个错误。我知道我应该玩高大的数组,但简单的tall(mapreduce())没有帮助。

作为专为大数据设计的mapreduce(),当输出也很大时,它们应该是一个解决方案。你能帮我找到这样的解决方案吗?

0 个答案:

没有答案