您能否建议使用以下文件匹配逻辑并使用Pig删除重复条目
1)根据关键RoleId删除重复的条目 -
InputFile1
--------------
RoleId Name
1 A
2 B
3 C
2 D
5 E
5 F
7 G
RoleId Name
1 A
3 C
7 G
RoleId Name
2 B
2 D
5 E
5 F
2)文件匹配键是RoleId -
InputFile1 InputFile2
----------- ----------
RoleId Name RoleId Age
1 A 1 20
2 B 2 21
3 C 1 22
4 D 2 23
5 E 3 24
7 25
OutpufFile1(匹配记录)OutputFile2(从1开始不匹配)
-------------------- -----------
RoleId Name Age RoleId Name
1 A 20, 22 4 D
2 B 21, 23 5 E
3 C 24
谢谢,
答案 0 :(得分:1)
你能尝试以下方法吗?
问题1解决方案:
的输入强>
1 A
2 B
3 C
2 D
5 E
5 F
7 G
<强> PigScript:强>
A = LOAD 'in.txt' USING PigStorage() AS(RoleId:int,Name:chararray);
B = GROUP A BY RoleId;
C = FOREACH B GENERATE FLATTEN($1) AS(RoleId,Name),COUNT(A) AS cnt;
SPLIT C INTO Distval IF (cnt==1), NonDistVal IF (cnt>=2);
D = FOREACH Distval GENERATE RoleId,Name;
STORE D INTO 'DistFile' USING PigStorage();
E = FOREACH NonDistVal GENERATE RoleId,Name;
STORE E INTO 'NonDistFile' USING PigStorage();
<强>输出:强>
cat DistFile / part-r-00000
1 A
3 C
7 G
cat NonDistFile / part-r-00000
2 B
2 D
5 E
5 F
问题2解决方案:
的 InputFile1 强>
1 A
2 B
3 C
4 D
5 E
<强> InputFile2 强>
1 20
2 21
1 22
2 23
3 24
7 25
<强> PigScript:强>
A = LOAD 'InputFile1' USING PigStorage() AS(RoleId:long, Name:chararray);
B = LOAD 'InputFile2' USING PigStorage() AS(RoleId:long, Age:int);
C = COGROUP A BY RoleId ,B BY RoleId;
D = FILTER C BY NOT IsEmpty(A);
SPLIT D INTO RoleMatch IF NOT IsEmpty(B),NoRoleMatch IF IsEmpty(B);
E = FOREACH RoleMatch GENERATE FLATTEN($1),BagToTuple(B.Age);
STORE E INTO 'RoleMatchFile' USING PigStorage();
F = FOREACH NoRoleMatch GENERATE FLATTEN($1);
STORE F INTO 'NoRoleMatchFile' USING PigStorage();
<强>输出:强>
cat RoleMatchFile / part-r-00000
1 A (20,22)
2 B (21,23)
3 C (24)
cat NoRoleMatchFile / part-r-00000
4 D
5 E