Pig Script将数据节点脱机

时间:2012-05-25 18:23:51

标签: hadoop mapreduce apache-pig

我在12节点Hadoop集群上运行以下Pig脚本,每个节点有30个map / reduce任务,每个任务有2GB内存:

A = LOAD '/path/to/gzipped/logs' USING PigStorage('\t');
B = FOREACH A GENERATE $4 AS foo, $10 AS foo2, $15 AS foo3;
C = FILTER B BY foo == '10000';
C1 = FILTER C BY foo2 IS NOT NULL;
C2 = FILTER C1 BY foo3 IS NOT NULL;
D = GROUP C2 BY foo3;
E = FOREACH D {
    foo2s = $1.$1;
    unique_foo2s = DISTINCT foo2s;
    GENERATE group, COUNT(unique_foo2s) as foo2_count;
};
F = FILTER E BY $1 > 5;
STORE F INTO 'bar.out' USING PigStorage();

当我尝试加载一个gzip压缩日志文件时,作业运行正常,但是当我尝试在整个目录上运行它时,它开始使数据节点脱机(每个约有500个gzipped日志@~2.7GB)。我做错了什么?

(要明确的是,我要做的是按一个字段分组,然后计算每个组的另一列中唯一条目的数量。)

0 个答案:

没有答案