这个两阶段pig
处理有效:
my_out = foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
};
my_out1 = foreach my_out {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
然而,当我把它们结合起来时:
my_out = foreach (foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
}) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
我收到错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.
我的问题是:
答案 0 :(得分:2)
通常,Pig解析复杂嵌套表达式的能力是不可靠的。嵌套变得太难处理的另一个常见错误是ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""
我经常尝试这样做,以避免为计算中的中间步骤之外的没有任何意义的别名提出一堆名称。但是有时候你不可能发现它。我的猜测是嵌套一个嵌套的foreach是不行的。但在你的情况下,看起来第一个嵌套的foreach不是必需的。试试这个:
my_out = foreach (foreach (group my_in by id)
generate
group as id,
CountEach(my_in.domain) as domains,
BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
至于你的第二个问题,否,这对最终的MR计划没有任何影响。这纯粹是Pig解析你的脚本的问题;通过以这种方式对命令进行分组,map-reduce逻辑不会改变。