我想提取列不明显的记录,我该如何实现呢?
例如输入:
(user1, value1, value2)
(user1, value3, value4)
(user2, value5, value6)
(user3, value7, value8)
(user4, value9, value10)
(user4, value11, value12)
在提取具有第1列重复值的记录后,输出将为:
(user1, value1, value2)
(user1, value3, value4)
(user4, value9, value10)
(user4, value11, value12)
提前多多感谢!
答案 0 :(得分:0)
如果这对你有用,请告诉我。出于测试目的,我使用value1和value2作为chararray,但在实际代码中将value1和value2更改为int或long
input.txt
user1,value1,value2
user1,value3,value4
user2,value5,value6
user3,value7,value8
user4,value9,value10
user4,value11,value12
PigScript
A = LOAD 'input.txt' USINg PigStorage(',') AS (user:chararray,value1:chararray,value2:chararray);
B = GROUP A BY user;
C = FOREACH B GENERATE FLATTEN(A),COUNT(A) AS cnt;
D = FILTER C BY cnt >1;
E = FOREACH D GENERATE A::user,A::value1,A::value2;
DUMP E;
Output:
(user1,value1,value2)
(user1,value3,value4)
(user4,value9,value10)
(user4,value11,value12)