使用Pig组合键的不同值

时间:2017-05-26 15:51:21

标签: hadoop apache-pig

输入数据集

data.csv
----------
col1,col2,col3
68,emp101,a1
74,emp101,null
56,emp101,a1
67,emp101,a2
45,emp102,b1
78,emp102,b2
23,emp102,b3

对于col2,我需要找到col3的不同值,不包括null。

emp101有两个不同的值-----> A1,A2 emp102有3个不同的值-----> B1,B2,B3

emp101有4条记录和2个不同的值,4条记录必须通过添加新的col4复制2次,这将是每个副本的col3的不同值。

emp102有3个记录和3个不同的值,3个记录必须复制3次并添加新的col4,这将是每个拷贝的col3的不同值。

Expected Output
col1,col2,col3,col4
68,emp101,a1,a1
74,emp101,null,a1
56,emp101,a1,a1
67,emp101,a2,a1
68,emp101,a1,a2
74,emp101,null,a2
56,emp101,a1,a2
67,emp101,a2,a2
45,emp102,b1,b1
78,emp102,b2,b1
23,emp102,b3,b1
45,emp102,b1,b2
78,emp102,b2,b2
23,emp102,b3,b2
45,emp102,b1,b3
78,emp102,b2,b3
23,emp102,b3,b3


grunt>input1= load 'data.csv' using PigStorage(',') as (age: int, eid: chararray, grade: chararray);
grunt>input2= GROUP input1 by eid;
grunt> input3= distinct input1 by eid,grade;
2017-05-26 08:35:59,056 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 31, column 24>  mismatched input 'by' expecting SEMI_COLON

1 个答案:

答案 0 :(得分:0)

让我们称之为原始关系,基础。

-- Create a relation with col2 and col3
-- Filter out where col3 is null
-- Take distinct tuples (equivalent to records)
col23_rel = DISTINCT (FOREACH (FILTER base BY col3 is not null) generate col2, col3);
dump col23_rel;
(emp101,a1)
(emp101,a2)
(emp102,b1)
(emp102,b2)
(emp102,b3)

-- Now, join col23_rel back to base on col2. This will generate desired output.
jnd = JOIN base by col2, col23_rel by col2; 

dump jnd;
(68,emp101,a1,emp101,a1)
(68,emp101,a1,emp101,a2)
(74,emp101,,emp101,a1)
(74,emp101,,emp101,a2)
(56,emp101,a1,emp101,a1)
(56,emp101,a1,emp101,a2)
(67,emp101,a2,emp101,a1)
(67,emp101,a2,emp101,a2)
(45,emp102,b1,emp102,b1)
(45,emp102,b1,emp102,b2)
(45,emp102,b1,emp102,b3)
(78,emp102,b2,emp102,b1)
(78,emp102,b2,emp102,b2)
(78,emp102,b2,emp102,b3)
(23,emp102,b3,emp102,b1)
(23,emp102,b3,emp102,b2)
(23,emp102,b3,emp102,b3)