Pig Latin将列拆分为行

时间:2014-05-27 08:34:22

标签: apache-pig

Pig latin中是否有任何解决方案可以将列转换为行以获得以下内容?

输入:

id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6

必需的输出:

id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6

感谢

1 个答案:

答案 0 :(得分:2)

我愿意打赌这不是最好的方法,但是......

data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray, 
       col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;

(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)

store final into '~/some_dir' using PigStorage('|');

编辑:我非常喜欢这个问题并且正在和同事讨论这个问题,他提出了一个更简单,更优雅的解决方案。如果您安装了Jython ...

#  create file called udf.py

@outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
    c1 = column1.split(',')
    c2 = column2.split(',')
    innerBag = zip(c1, c2)
    return innerBag

然后在Pig

$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
       column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')