使用猪或hadoop找到意思

时间:2012-09-26 01:56:24

标签: hadoop apache-pig

我有一个巨大的文本文件

数据保存在目录data / data1.txt,data2.txt等

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

我想做的是每个商家,找到平均数量..

所以基本上我最终想要将输出保存在文件中。

之类的东西
 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

我如何计算标准偏差?

很抱歉提出这样一个基本问题。 :( 任何帮助,将不胜感激。 :)

4 个答案:

答案 0 :(得分:13)

Apache PIG非常适合此类任务。见例:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

要特别注意amnt列的数据类型,因为它会影响SUM函数PIG要调用的实现。

PIG也可以执行SQL无法做到的事情,它可以在不使用任何内部联接的情况下对每个输入行设置平均值。如果您使用标准偏差计算z分数,那么这很有用。

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt)可以解决这个问题,现在您可以访问对组平均值,总和和计数做出贡献的原始金额。

更新1:

Calculating variance and standard deviation

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

它将使用2个工作。我还没弄明白怎么做,嗯,需要花更多的时间在上面。

答案 1 :(得分:1)

那你想要什么?您想要运行java代码还是抽象map-reduce进程?对于第二个:

地图步骤:

record -> (merchant_id as key, amount as value)

减少步骤:

(merchant_id, amount) -> (merchant_id, aggregate the value you want)

在缩小步骤中,您将获得具有相同键的记录流,您几乎可以做任何事情,包括平均值,差异。

答案 2 :(得分:1)

您可以一步计算标准偏差;使用公式

messages.subtypes.properties

那就是它!

答案 3 :(得分:0)

我只用1个循环就计算了所有统计信息(最小值,最大值,均值和标准差)。 FILTER_DATA包含数据集。

    GROUP_SYMBOL_YEAR = GROUP FILTER_DATA BY (SYMBOL, SUBSTRING(TIMESTAMP,0,4));
STATS_ALL = FOREACH GROUP_SYMBOL_YEAR { 
    MINIMUM = MIN(FILTER_DATA.CLOSE);
    MAXIMUM = MAX(FILTER_DATA.CLOSE);
    MEAN = AVG(FILTER_DATA.CLOSE);
    CNT = COUNT(FILTER_DATA.CLOSE);
    CSQ = FOREACH FILTER_DATA GENERATE CLOSE * CLOSE AS (CC:DOUBLE);
    GENERATE group.$0 AS (SYMBOL:CHARARRAY), MINIMUM AS (MIN:DOUBLE), MAXIMUM AS (MAX:DOUBLE), ROUND_TO(MEAN,6) AS (MEAN:DOUBLE), ROUND_TO(SQRT(SUM(CSQ.CC) / (CNT * 1.0) - (MEAN * MEAN)),6) AS (STDDEV:DOUBLE), group.$1 AS (YEAR:INT);
};