我想知道是否可以在Apache Pig中一次转动一个表。
输入:
Id Column1 Column2 Column3
1 Row11 Row12 Row13
2 Row21 Row22 Row23
输出:
Id Name Value
1 Column1 Row11
1 Column2 Row12
1 Column3 Row13
2 Column1 Row21
2 Column2 Row22
2 Column3 Row23
真实数据有几十列。
我可以在一次传递中使用awk然后使用Hadoop Streaming运行它。但我的大部分代码都是Apache Pig,所以我想知道是否有可能在Pig中有效地完成它。
答案 0 :(得分:7)
你可以通过两种方式实现: 1.编写一个返回一包元组的UDF。它将是最灵活的解决方案,但需要Java代码; 2.写一个像这样严格的脚本:
inpt = load '/pig_fun/input/pivot.txt' as (Id, Column1, Column2, Column3);
bagged = foreach inpt generate Id, TOBAG(TOTUPLE('Column1', Column1), TOTUPLE('Column2', Column2), TOTUPLE('Column3', Column3)) as toPivot;
pivoted_1 = foreach bagged generate Id, FLATTEN(toPivot) as t_value;
pivoted = foreach pivoted_1 generate Id, FLATTEN(t_value);
dump pivoted;
运行此脚本让我得到以下结果:
(1,Column1,11)
(1,Column2,12)
(1,Column3,13)
(2,Column1,21)
(2,Column2,22)
(2,Column3,23)
(3,Column1,31)
(3,Column2,32)
(3,Column3,33)
答案 1 :(得分:3)
我从id 1中删除了col3,以显示如何处理可选(NULL)数据
Id名称值 1列1行11 1列2行12 2 Column1 Row21 2 Column2 Row22 2 Column3 Row23
- pigscript.pig
data1 = load 'data.txt' using PigStorage() as (id:int, key:chararray, value:chararray);
grped = group data1 by id;
pvt = foreach grped {
col1 = filter data1 by key =='Column1';
col2 =filter data1 by key =='Column2';
col3 =filter data1 by key =='Column3';
generate flatten(group) as id,
flatten(col1.value) as col1,
flatten(col2.value) as col2,
flatten((IsEmpty(col3.value) ? {('NULL')} : col3.value)) as col3; --HANDLE NULL
};
dump pvt;
结果:
(1,Row11,Row12,NULL)
(2,Row21,Row22,Row23)