Pig:提取列不相同的记录

时间:2014-10-10 00:55:58

标签: apache-pig

我想提取列不明显的记录,我该如何实现呢?

例如输入:

(user1, value1, value2)
(user1, value3, value4)
(user2, value5, value6)
(user3, value7, value8)
(user4, value9, value10)
(user4, value11, value12)

在提取具有第1列重复值的记录后,输出将为:

(user1, value1, value2)
(user1, value3, value4)
(user4, value9, value10)
(user4, value11, value12)

提前多多感谢!

1 个答案:

答案 0 :(得分:0)

如果这对你有用,请告诉我。出于测试目的,我使用value1和value2作为chararray,但在实际代码中将value1和value2更改为int或long

input.txt
user1,value1,value2
user1,value3,value4
user2,value5,value6
user3,value7,value8
user4,value9,value10
user4,value11,value12

PigScript
A = LOAD 'input.txt' USINg PigStorage(',') AS (user:chararray,value1:chararray,value2:chararray);
B = GROUP A BY user;
C = FOREACH B  GENERATE FLATTEN(A),COUNT(A) AS cnt;
D = FILTER C BY cnt >1;
E = FOREACH D GENERATE A::user,A::value1,A::value2;
DUMP E;

Output:
(user1,value1,value2)
(user1,value3,value4)
(user4,value9,value10)
(user4,value11,value12)