对PIG Latin中的记录进行分组和计数

时间:2016-05-13 16:04:20

标签: hadoop apache-pig

我是PIG拉丁语的新手,我正在尝试解决以下问题

查找每个areacode都有电话号码的员工数量。

EMPID   ADD_ID     ZIP    SAL   PHONE        DAT
Abcd411 PbcDr60264 953492 46404 111-432-4193 20150113
Abcd874 PbcDr39353 186307 29873 100-432-9164 20150728
Abcd197 PbcDr46725 306185 31908 113-432-4191 20150410
Abcd160 PbcDr77738 330533 61313 105-432-2468 20151007
Abcd327 PbcDr10034 951703 39301 109-432-9235 20150805
Abcd172 PbcDr21679 683299 71686 105-432-5616 20150908
Abcd227 PbcDr57694 876619 46743 109-432-9181 20151101
Abcd900 PbcDr80166 970136 34242 105-432-7415 20150820
Abcd318 PbcDr34711 234066 10989 101-432-9667 20150906
Abcd702 PbcDr86734 997954 97688 105-432-6592 20151026

以下是我试图解决它的方式。

empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate count(grpdata);

但我没有收到错误声明: - Invalid scalar projection: grpdata : A column needs to be projected from a relation for it to be used as a scalar

在同一数据集的另一个问题陈述中

Find number of employees having date of joining between 2015-01-01 to 2015-05-28. 

我尝试了以下解决方案,但这次我没有得到任何结果。

empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, doj:chararray);
filtDate = filter empdata by ToDate(doj, 'yyyyMMdd') >= ToDate('20150101', 'yyyymmdd') AND ToDate(doj, 'yyyyMMdd') <= ToDate('20150528', 'yyyymmdd');

请帮助解释。

2 个答案:

答案 0 :(得分:1)

试试这个

empdata = LOAD '/home/cloudera/empData.txt' as using PigStorage(' ') (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate $0, COUNT(empdata);

答案 1 :(得分:0)

你应该算上empdata

dataCnt = foreach grpdata generate COUNT(empdata);