Question

我有一个非常大的输入文件，数据格式如下：

id1 id2 id3 id4

文件非常大，大约1000W线。

我写的猪脚本是：

A = load '/input' using PigStorage(' ');
B = foreach A generate $2 as id3;
id_group = GROUP B BY id3;
count_id = FOREACH id_group GENERATE group, COUNT(B.id3);
Store count_id INTO 'statistic';

当文件很小时，猪脚本成功，但是当我使用大输入时，猪脚本失败了。它显示：

2013-10-10 23:25:01,655 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1% complete
2013-10-10 23:25:05,686 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 3% complete

2013-10-10 23:27:52,894 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201309291007_0201 has failed! Stop running all dependent jobs
2013-10-10 23:27:52,894 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-10-10 23:27:52,916 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-10-10 23:27:52,918 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.2.1   0.10.0  bhbz    2013-10-10 23:24:48     2013-10-10 23:27:52     GROUP_BY

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201309291007_0201   A,B,count_id,id_group   GROUP_BY,COMBINER       Message: Job failed! Error - NA hdfs://h1061.mzhen.cn:9000/user/bhbz/statistic1,

Input(s):
Failed to read data from "/dataSet/public.mbm.3.0"

Output(s):
Failed to produce result in "hdfs://h1061.mzhen.cn:9000/user/bhbz/statistic1"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201309291007_0201


2013-10-10 23:27:52,918 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-10-10 23:27:52,932 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message

是因为“GROUP”使用了太多内存吗？

如何使用猪统计字段num

0 个答案: