(Hortonworks Sandbox)Pig Join操作重复主键列

时间:2018-04-09 20:14:46

标签: join duplicates apache-pig hortonworks-sandbox

我想要加入两个表。 table1有 id value 列 table2有 id 颜色列。

 =IFERROR("This Equals " & SUBSTITUTE(INDEX($G:$G,AGGREGATE(15,6,ROW($G$1:$G$4)/(($G$1:$G$4="#7")+($G$1:$G$4="#8")+($G$1:$G$4="#9")),ROW(1:1))),"#",""),"")

我收到的表格的列 id id 颜色。但我希望获得一个包含 id 颜色等列的表格。如何从此表中删除此重复的id列?

2 个答案:

答案 0 :(得分:0)

如果你DESCRIBE final;,你会发现架构看起来像这样:

final: {table1::id: chararray,table1::value: chararray,table2::id: chararray,table2::color: chararray}

要区分这两个ID列,您可以使用table1::idtable2::id。因此,要删除其中一个重复列,您可以执行以下操作:

A = FOREACH final GENERATE 
    table1::id AS id,
    table1::value AS value,
    table2::color AS color;

(我还重新命名了字段以删除table1::table2::前缀,因为它们不再需要。)

我本可以做到:

A = FOREACH final GENERATE 
    table1::id AS id,
    value AS value,
    color AS color;

这不会给我一个错误,因为valuecolor是明确的名称。

答案 1 :(得分:0)

执行最终的PIG脚本:

grunt> table1 = LOAD 'table1_input_path' USING PigStorage(',') as (id:int, value:int);
grunt> table2= LOAD 'table2_input_path' USING PigStorage(',') as (id:int, color:chararray);
grunt> joinlevel = JOIN table1 BY id, table2 BY id;
grunt> final = FOREACH joinlevel generate table1::id as id, table1::color as color, table2::value as value;
grunt> dump final;