我有一个巨大的文本文件
数据保存在目录data / data1.txt,data2.txt等
中merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
1234, 0124, 230
and so on..
我想做的是每个商家,找到平均数量..
所以基本上我最终想要将输出保存在文件中。
之类的东西 merchant_id, average_amount
1234, avg_amt_1234 a
and so on.
我如何计算标准偏差?
很抱歉提出这样一个基本问题。 :( 任何帮助,将不胜感激。 :)
答案 0 :(得分:13)
Apache PIG非常适合此类任务。见例:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate group as id, sum/count as mean, sum as sum, count as count;
};
要特别注意amnt列的数据类型,因为它会影响SUM函数PIG要调用的实现。
PIG也可以执行SQL无法做到的事情,它可以在不使用任何内部联接的情况下对每个输入行设置平均值。如果您使用标准偏差计算z分数,那么这很有用。
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
FLATTEN(inpt)可以解决这个问题,现在您可以访问对组平均值,总和和计数做出贡献的原始金额。
更新1:
Calculating variance and standard deviation:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
dif = (amnt - avg) * (amnt - avg) ;
generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum;
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
它将使用2个工作。我还没弄明白怎么做,嗯,需要花更多的时间在上面。
答案 1 :(得分:1)
那你想要什么?您想要运行java代码还是抽象map-reduce进程?对于第二个:
地图步骤:
record -> (merchant_id as key, amount as value)
减少步骤:
(merchant_id, amount) -> (merchant_id, aggregate the value you want)
在缩小步骤中,您将获得具有相同键的记录流,您几乎可以做任何事情,包括平均值,差异。
答案 2 :(得分:1)
您可以一步计算标准偏差;使用公式
messages.subtypes.properties
那就是它!
答案 3 :(得分:0)
我只用1个循环就计算了所有统计信息(最小值,最大值,均值和标准差)。 FILTER_DATA包含数据集。
GROUP_SYMBOL_YEAR = GROUP FILTER_DATA BY (SYMBOL, SUBSTRING(TIMESTAMP,0,4));
STATS_ALL = FOREACH GROUP_SYMBOL_YEAR {
MINIMUM = MIN(FILTER_DATA.CLOSE);
MAXIMUM = MAX(FILTER_DATA.CLOSE);
MEAN = AVG(FILTER_DATA.CLOSE);
CNT = COUNT(FILTER_DATA.CLOSE);
CSQ = FOREACH FILTER_DATA GENERATE CLOSE * CLOSE AS (CC:DOUBLE);
GENERATE group.$0 AS (SYMBOL:CHARARRAY), MINIMUM AS (MIN:DOUBLE), MAXIMUM AS (MAX:DOUBLE), ROUND_TO(MEAN,6) AS (MEAN:DOUBLE), ROUND_TO(SQRT(SUM(CSQ.CC) / (CNT * 1.0) - (MEAN * MEAN)),6) AS (STDDEV:DOUBLE), group.$1 AS (YEAR:INT);
};