现在,我有这样的数据:
data_new =
foreach data
generate
(float)$0,
(float)$1,
......;
如果data_new
只有2
列,我知道我们可以对每列进行规范化,如下所示:
grp_data = group data_new all;
tmp = foreach grp_data {
sum0 = SUM($0);
sum1 = SUM($1);
count = COUNT(data_new);
generate
flatten(data_new),
sum0/count as mean1,
sum1/count as mean2,
count as count;
};
tmp = foreach tmp {
diff0 = ($0 - mean0) * ($0 - mean0);
diff1 = ($1 - mean1) * ($1 - mean1);
geneate *, diff0 as diff0, diff1 as diff1;
};
grp_data = group tmp all;
tmp =
foreach grp_data
generate
flatten(tmp),
SUM(tmp.diff0) as diff0,
SUM(tmp.diff1) as diff1;
tmp =
foreach tmp
generate
($0 - mean0) / (SQRT(diff0/count)),
($1 - mean1) / (SQRT(diff1/count));
data_new
中列太多时,无法使用上述方法。那么我可以问一下是否有任何方法可以实现相同的功能?