Pig:如何从一个包中访问多个元组的字段

时间:2014-09-17 14:16:50

标签: apache-pig

我的猪脚本:

A = LOAD 'average.txt' as line;  
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(line,'^(.\*?)\\s+(.\*?)\\s+(.*?) AS TUPLE(AA:chararray,BB:chararray,CC:chararray);  
C = FILTER B BY tuple_0.AA IS NOT NULL;  
D = GROUP C BY $0.AA;  

group stmt后的输出:

(1,{((1,a,b)),((1,c,d))})  
(2,{((2,e,f)),((2,g,h))})

我需要这样的最终输出:

(1,a,b,c,d)  
(2,e,f,g,h)

描述查询:

| D     | group:chararray     | C:bag{:tuple(tuple_0:tuple(AA:chararray,BB:chararray,CC:chararray))}  

1 个答案:

答案 0 :(得分:0)

我建议在C上进行自我加入,而不是按$ 0.AA分组:

A = LOAD 'average.txt' as line;  
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(line,'^(.\*?)\\s+(.\*?)\\s+(.*?) AS     TUPLE(AA:chararray,BB:chararray,CC:chararray);  
C = FILTER B BY tuple_0.AA IS NOT NULL;  
C = FOREACH C GENERATE tuple_0.AA AS AA, tuple_0.BB AS BB, tuple_0.CC AS CC; --renaming columns to easy names

D = FOREACH C GENERATE AA, BB, CC;  -- clone of C

CD = JOIN C BY AA, D BY AA;
CD2 = FOREACH CD 
         GENERATE 
            C::AA AS AA, 
            C::BB AS CBB, 
            C::CC AS CCC, 
            D::BB AS DBB,
            D::CC AS DCC;

我希望这会有所帮助。