如何使用Hadoop Pig对每个功能进行规范化?

时间:2017-08-09 08:53:11

标签: hadoop apache-pig

现在,我有这样的数据:

data_new = 
    foreach data
    generate
        (float)$0,
        (float)$1,
        ......;

如果data_new只有2列,我知道我们可以对每列进行规范化,如下所示:

grp_data = group data_new all;
tmp = foreach grp_data {
    sum0 = SUM($0);
    sum1 = SUM($1);
    count = COUNT(data_new);
    generate
        flatten(data_new),
        sum0/count as mean1,
        sum1/count as mean2,
        count as count;
};

tmp = foreach tmp {
    diff0 = ($0 - mean0) * ($0 - mean0);
    diff1 = ($1 - mean1) * ($1 - mean1);
    geneate *, diff0 as diff0, diff1 as diff1;
};

grp_data = group tmp all;
tmp = 
    foreach grp_data 
    generate
        flatten(tmp),
        SUM(tmp.diff0) as diff0,
        SUM(tmp.diff1) as diff1;

tmp =
    foreach tmp
    generate 
        ($0 - mean0) / (SQRT(diff0/count)),
        ($1 - mean1) / (SQRT(diff1/count));

data_new列太多时,无法使用上述方法。那么我可以问一下是否有任何方法可以实现相同的功能?

0 个答案:

没有答案
相关问题