存储Pig中相同键的映射列表

时间:2017-05-29 14:42:35

标签: hadoop apache-pig

我有一个用例

map1.csv

col0|col1
10|a1
20|b1

map2.csv

col1|col2|col3|col4
a1|aa|ab|ac
a1|ba|bb|bc
a1|ca|cb|cc
b1|mm|mn|mo
b1|xy|yz|xz

我需要基于col1将map1.csv与map2.csv连接起来。如果col1匹配说a1,我需要取col2,col3和amp;的值。 col4并将其作为列表存储在地图中。

将密钥硬编码为col2,col3,col4。

预期产出:

10|a1|[{"col2": "aa","col3": "ab","col4": "ac"},{"col2": "ba","col3": "bb","col4": "bc"},{"col2": "ca","col3": "cb","col4": "cc"}]
20|b1|[{"col2": "mm","col3": "mn","col4": "mo"},{"col2": "xy","col3": "yz","col4": "xz"}]

脚本如下:

    input1= load 'map1.csv' using PigStorage('|') as (col0: int, col1: chararray);
    input2= load 'map2.csv' using PigStorage('|') as (col1: chararray, col2: chararray,col3: chararray, col4: chararray);
    input3 = GROUP input2 by col1;
    input4 = JOIN input1 by col1, input3 by col1;
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: 
<line 4, column 40> Invalid field projection. Projected field [col1] does not exist in schema: group:chararray,input2:bag{:tuple(col1:chararray,col2:chararray,col3:chararray,col4:chararray)}.

0 个答案:

没有答案