我正在尝试在猪中进行星型模式类型的连接,下面是我的代码。当我连接不同列的多个关系时,我必须每次都为前一个连接的名称添加前缀以使其正常工作。我相信应该有更好的方法,我无法通过谷歌搜索找到它。任何指针都会非常有用。
即为这样的列添加前缀“H864 :: H86 :: hs_8_d :: hs_8_desc”是我想要避免的。
hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray);
hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4,SUBSTRING(hs_8,0,6) as hs_6,hs_8,hs_8_desc;
hs_6_d = LOAD 'hs_6_distinct' USING PigStorage('^') as (hs_6:chararray,hs_6_desc:chararray);
hs_4_d = LOAD 'hs_4_distinct' USING PigStorage('^') as (hs_4:chararray,hs_4_desc:chararray);
hs_2_d = LOAD 'hs_2_distinct' USING PigStorage('^') as (hs_2:chararray,hs_2_desc:chararray);
H86 = JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated' ;
H864 = JOIN H86 BY hs_8_d::hs_4, hs_4_d BY hs_4 USING 'replicated' ;
H8642 = JOIN H864 BY H86::hs_8_d::hs_2, hs_2_d BY hs_2 USING 'replicated' ;
hs_dim = FOREACH H8642 GENERATE hs_2_d::hs_2,hs_2_d::hs_2_desc,H864::hs_4_d::hs_4,H864::hs_4_d::hs_4_desc,H864::H86::hs_6_d::hs_6,H864::H86::hs_6_d::hs_6_desc,H864::H86::hs_8_d::hs_8,H864::H86::hs_8_d::hs_8_desc;
答案 0 :(得分:2)
通过向连接添加额外的foreach,您可以略微简化别名。检查统计信息,这不会向管道添加额外的MR作业。原始和此将产生4个仅限地图的作业。
E.g:
H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate
hs_8_d::hs_2 as x1,
hs_8_d::hs_4 as x2,
hs_8_d::hs_6 as x3,
hs_8_d::hs_8 as x4,
hs_8_d::hs_8_desc as x5,
hs_6_d::hs_6 as x6,
hs_6_d::hs_6_desc as x7;
H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate
H86::x1 as y1,
H86::x2 as y2,
H86::x3 as y3,
H86::x4 as y4,
H86::x5 as y5,
H86::x6 as y6,
H86::x7 as y7,
hs_4_d::hs_4 as y8,
hs_4_d::hs_4_desc as y9;
H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate
H864::y1 as z1,
H864::y2 as z2,
H864::y3 as z3,
H864::y4 as z4,
H864::y5 as z5,
H864::y6 as z6,
H864::y7 as z7,
H864::y8 as z8,
H864::y9 as z9,
hs_2_d::hs_2 as z10,
hs_2_d::hs_2_desc as z11;
hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5;
如果您有一包元组,那么Datafu的AliasBagFields可能会有所帮助。
答案 1 :(得分:0)
Pig将始终使用bagname::
作为字段的前缀,以便在连接后消除字段歧义。不幸的是,我认为你不能避免这种情况。