我想使用pig来联合/合并两个文件。但是,这是一个与通常的联盟不同的联盟。以下是我的文件(h *是文件标题):
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
结果输出必须是这些文件的联合,如下所示:
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
另一个难点是我想让它变得通用,以便它适用于动态数量的公共列。目前有两个常见的列,它可以有3个或1个公共列,甚至根本没有公共列。例如:
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
任何提示/帮助都很明显。
答案 0 :(得分:0)
以下是静态执行的方法:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig不是很灵活,所以我不认为可以动态地生成这个/通用案例。
如果你想要一个通用案例的解决方案,你可以使用像python这样的语言来根据存储的表/文件的元数据构建所需的命令。
答案 1 :(得分:0)
我尝试使用以下方法解决问题:
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
以下是猪脚本也这样做。由于此脚本是通用的,我已经提到在运行脚本之前所需的所有参数。
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');