猪 - 我得到了“错误:Java堆空间”,有数十万个元组

时间:2013-08-30 09:18:22

标签: java apache-pig

我有三组数据按类型分开,通常每个uid只有几百个元组。但是(可能是由于某些错误),很少有uid拥有高达200000-300000行的数据。

当单个数据库中的元组太多时,StuffProcessor有时会抛出堆空间错误。我该怎么解决这个问题?我能否以某种方式检查单个uid是否有100000多个元组,然后将数据分成更小的批次?

我对猪全新,几乎不知道我在做什么。

-- Create union of the three stuffs 
stuff = UNION stuff1, stuff2, stuff3;

-- Group data by uid
stuffGrouped = group stuff by (long)$0;

-- Process data
processedStuff = foreach stuffGrouped generate StuffProcessor(stuff);

-- Flatten the UID groups into single table
flatProcessedStuff = foreach processedStuff generate FLATTEN($0);

-- Separate into different datasets by type, these are all schemaless
processedStuff1 = filter flatProcessedStuff by (int)$5 == 9;
processedStuff2 = filter flatProcessedStuff by (int)$5 == 17;
processedStuff3 = filter flatProcessedStuff by (int)$5 == 20;

-- Store everything into separate files into HDFS
store processedStuff1 into '$PROCESSING_DIR/stuff1.txt';
store processedStuff2 into '$PROCESSING_DIR/stuff2.txt';
store processedStuff3 into '$PROCESSING_DIR/stuff3.txt';

Cloudera集群应该分配4GB堆空间

这实际上可能与cloudera用户有关,因为我无法与某些用户(小猪用户和hdfs用户)重现此问题。

1 个答案:

答案 0 :(得分:1)

如果您的UDF确实不需要同时查看属于某个键的所有元组,您可能需要实现Accumulator接口以便通过较小批量处理它们。您还可以考虑实施Algebraic界面以加快流程。

内置COUNT就是一个很好的例子。