我正在使用猪进行数据准备,我遇到了一个似乎很容易但我无法处理的问题: 例如,我有一列名字 INPUT: -
id | name
-------------
1 | Alicia
2 | Ana
3 | Benita
4 | Berta
5 | Bertha
我期待Desired OUTPUT :-(我们可以使用FORLOOP功能实现这一目标吗?)
id | name
--------------------------
1_XX_1 | Alicia_id_1
2_XX_1 | Ana_id_1
3_XX_1 | Benita_id_1
4_XX_1 | Berta_id_1
5_XX_1 | Bertha_id_1
1_XX_2 | Alicia_id_2
2_XX_2 | Ana_id_2
3_XX_2 | Benita_id_2
4_XX_2 | Berta_id_2
5_XX_2 | Bertha_id_2
1_XX_3 | Alicia_id_3
2_XX_3 | Ana_id_3
3_XX_3 | Benita_id_3
4_XX_3 | Berta_id_3
5_XX_3 | Bertha_id_3
答案 0 :(得分:2)
您可以使用UDF执行此操作,这将为您提供有关输入要复制的次数的可重用性。 UDF下面会这样做。
REGISTER '/path/to/pigexerciseudf.jar';
define replicat pigexerciseudf.replicateinput('3');
A = LOAD '/home/hduser/exer.dat' using PigStorage(',') as (a:chararray,b:chararray);
B = FOREACH A GENERATE FLATTEN(replicat(a,b)) as (line:chararray) ;
dump B;
输入文件:
1,艾丽西亚
2,Ana
3,贝尼塔
4,Berta5,伯莎
{{1}}
输出:
(1_XX_1,Alicia_id_1)
(1_XX_2,Alicia_id_2)
(1_XX_3,Alicia_id_3)
(2_XX_1,Ana_id_1)
(2_XX_2,Ana_id_2)
(2_XX_3,Ana_id_3)
(3_XX_1,Benita_id_1)
(3_XX_2,Benita_id_2)
(3_XX_3,Benita_id_3)
(4_XX_1,Berta _id_1)
(4_XX_2,Berta _id_2)
(4_XX_3,Berta _id_3)
(5_XX_1,Bertha_id_1)
(5_XX_2,Bertha_id_2)
(5_XX_3,Bertha_id_3)