输入数据集
data.csv
----------
col1,col2,col3
68,emp101,a1
74,emp101,null
56,emp101,a1
67,emp101,a2
45,emp102,b1
78,emp102,b2
23,emp102,b3
对于col2,我需要找到col3的不同值,不包括null。
emp101有两个不同的值-----> A1,A2 emp102有3个不同的值-----> B1,B2,B3
emp101有4条记录和2个不同的值,4条记录必须通过添加新的col4复制2次,这将是每个副本的col3的不同值。
emp102有3个记录和3个不同的值,3个记录必须复制3次并添加新的col4,这将是每个拷贝的col3的不同值。
Expected Output
col1,col2,col3,col4
68,emp101,a1,a1
74,emp101,null,a1
56,emp101,a1,a1
67,emp101,a2,a1
68,emp101,a1,a2
74,emp101,null,a2
56,emp101,a1,a2
67,emp101,a2,a2
45,emp102,b1,b1
78,emp102,b2,b1
23,emp102,b3,b1
45,emp102,b1,b2
78,emp102,b2,b2
23,emp102,b3,b2
45,emp102,b1,b3
78,emp102,b2,b3
23,emp102,b3,b3
grunt>input1= load 'data.csv' using PigStorage(',') as (age: int, eid: chararray, grade: chararray);
grunt>input2= GROUP input1 by eid;
grunt> input3= distinct input1 by eid,grade;
2017-05-26 08:35:59,056 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 31, column 24> mismatched input 'by' expecting SEMI_COLON
答案 0 :(得分:0)
让我们称之为原始关系,基础。
-- Create a relation with col2 and col3
-- Filter out where col3 is null
-- Take distinct tuples (equivalent to records)
col23_rel = DISTINCT (FOREACH (FILTER base BY col3 is not null) generate col2, col3);
dump col23_rel;
(emp101,a1)
(emp101,a2)
(emp102,b1)
(emp102,b2)
(emp102,b3)
-- Now, join col23_rel back to base on col2. This will generate desired output.
jnd = JOIN base by col2, col23_rel by col2;
dump jnd;
(68,emp101,a1,emp101,a1)
(68,emp101,a1,emp101,a2)
(74,emp101,,emp101,a1)
(74,emp101,,emp101,a2)
(56,emp101,a1,emp101,a1)
(56,emp101,a1,emp101,a2)
(67,emp101,a2,emp101,a1)
(67,emp101,a2,emp101,a2)
(45,emp102,b1,emp102,b1)
(45,emp102,b1,emp102,b2)
(45,emp102,b1,emp102,b3)
(78,emp102,b2,emp102,b1)
(78,emp102,b2,emp102,b2)
(78,emp102,b2,emp102,b3)
(23,emp102,b3,emp102,b1)
(23,emp102,b3,emp102,b2)
(23,emp102,b3,emp102,b3)