我有一个文件,其中的行看起来像这样。
('www.example.com', 'FirstName LastName', '12345', 'Firstname', 'Lastname', '1967-05-16', 'Organization name')
使用PIG我想计算文件中出现的“组织名称”相同的次数,并按以下格式输出
'Count Result','www.example.com', 'FirstName LastName', 'Organization name'
这是我到目前为止所尝试的内容,我知道我在countOccurance
行上遗漏了一些内容,但无法弄明白:
data = LOAD 'data' AS (line:chararray);
data = FOREACH data GENERATE line, REPLACE(REPLACE(line, '\\(',''),'\\)','');
data = FOREACH data GENERATE STRSPLIT(line, '\\,') as entity;
grouped = GROUP data BY entity.$6;
countOccurance = FOREACH grouped GENERATE group as entity.$6,COUNT(data);
DUMP countOccurance;
答案 0 :(得分:1)
自从我和猪做过任何事以来已经有一段时间了,但我认为你可以做到。
data = LOAD date USING pigstorage(',') AS (URL:chararray, FULLNAME:chararray, ..., COMPANYNAME:chararray);
data = FOREACH (GROUP data BY COMPANYNAME) GENERATE COUNT(data.COMPANYNAME), data.URL, data.FULLNAME, data.COMPANYNAME;
DUMP data;
当然,用其他列名替换....