使用pig

时间:2016-09-16 13:25:15

标签: hadoop apache-pig

我想使用pig来联合/合并两个文件。但是,这是一个与通常的联盟不同的联盟。以下是我的文件(h *是文件标题):

F1 : 
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14

F2 : 
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12

结果输出必须是这些文件的联合,如下所示:

FR :
h1,h2,h3,h4,h5,h6 
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12

另一个难点是我想让它变得通用,以便它适用于动态数量的公共列。目前有两个常见的列,它可以有3个或1个公共列,甚至根本没有公共列。例如:

F1 :
h1,h2,h3,h4
a1,a2,a3,a4

F2
h5,h6,h7,h8
b1,b2,b3,b4

FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4

任何提示/帮助都很明显。

2 个答案:

答案 0 :(得分:0)

以下是静态执行的方法:

F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;

FR = F1full UNION F2full;

Pig不是很灵活,所以我不认为可以动态地生成这个/通用案例。

如果你想要一个通用案例的解决方案,你可以使用像python这样的语言来根据存储的表/文件的元数据构建所需的命令。

答案 1 :(得分:0)

我尝试使用以下方法解决问题:

1) Load both of the files. 
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.

以下是猪脚本也这样做。由于此脚本是通用的,我已经提到在运行脚本之前所需的所有参数。

-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);

RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;

COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);

CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;

JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;

STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');