使用Pig脚本创建具有匹配和不匹配记录的文件

时间:2015-02-01 00:39:43

标签: hadoop apache-pig

您能否建议使用以下文件匹配逻辑并使用Pig删除重复条目

1)根据关键RoleId删除重复的条目 -

InputFile1
--------------
RoleId   Name 
1        A 
2        B 
3        C
2        D 
5        E
5        F
7        G

OutpufFile1(仅限唯一记录)

RoleId   Name 
1        A 
3        C
7        G

OutpufFile2(捕获重复记录)

RoleId   Name 
2        B
2        D
5        E
5        F

2)文件匹配键是RoleId -

InputFile1  InputFile2 
----------- ---------- 
RoleId Name RoleId Age 
1      A    1      20 
2      B    2      21 
3      C    1      22 
4      D    2      23 
5      E    3      24 
            7      25

OutpufFile1(匹配记录)OutputFile2(从1开始不匹配)

--------------------           ----------- 

    RoleId Name Age                RoleId Name 
    1      A    20, 22             4      D 
    2      B    21, 23             5      E
    3      C    24

谢谢,

1 个答案:

答案 0 :(得分:1)

你能尝试以下方法吗?

问题1解决方案:
输入

1       A
2       B
3       C
2       D
5       E
5       F
7       G

<强> PigScript:

A = LOAD 'in.txt' USING PigStorage() AS(RoleId:int,Name:chararray);
B = GROUP A BY RoleId;
C = FOREACH B GENERATE FLATTEN($1) AS(RoleId,Name),COUNT(A) AS cnt;

SPLIT C INTO Distval IF (cnt==1), NonDistVal IF (cnt>=2);

D = FOREACH Distval GENERATE RoleId,Name;
STORE D INTO 'DistFile' USING PigStorage();

E = FOREACH NonDistVal GENERATE RoleId,Name;
STORE E INTO 'NonDistFile' USING PigStorage(); 

<强>输出:
cat DistFile / part-r-00000

1       A
3       C
7       G

cat NonDistFile / part-r-00000

2       B
2       D
5       E
5       F  

问题2解决方案:
InputFile1

1       A
2       B
3       C
4       D
5       E

<强> InputFile2

1       20
2       21
1       22
2       23
3       24
7       25

<强> PigScript:

A = LOAD 'InputFile1' USING PigStorage() AS(RoleId:long, Name:chararray);
B = LOAD 'InputFile2' USING PigStorage() AS(RoleId:long, Age:int);

C = COGROUP A BY RoleId ,B BY RoleId;
D = FILTER C BY NOT IsEmpty(A);

SPLIT D INTO RoleMatch IF NOT IsEmpty(B),NoRoleMatch IF IsEmpty(B);


E = FOREACH RoleMatch GENERATE FLATTEN($1),BagToTuple(B.Age);
STORE E INTO 'RoleMatchFile' USING PigStorage();


F = FOREACH NoRoleMatch GENERATE FLATTEN($1);
STORE F  INTO 'NoRoleMatchFile' USING PigStorage();

<强>输出:
cat RoleMatchFile / part-r-00000

1       A       (20,22)
2       B       (21,23)
3       C       (24)   

cat NoRoleMatchFile / part-r-00000

4       D
5       E