在列猪中计算单词出现次数

时间:2016-01-22 13:08:42

标签: hadoop count apache-pig

我有一个文件,其中的行看起来像这样。

('www.example.com', 'FirstName LastName', '12345', 'Firstname', 'Lastname', '1967-05-16', 'Organization name')

使用PIG我想计算文件中出现的“组织名称”相同的次数,并按以下格式输出

'Count Result','www.example.com', 'FirstName LastName', 'Organization name'

这是我到目前为止所尝试的内容,我知道我在countOccurance行上遗漏了一些内容,但无法弄明白:

data = LOAD 'data' AS (line:chararray);
data = FOREACH data GENERATE line, REPLACE(REPLACE(line, '\\(',''),'\\)','');
data = FOREACH data GENERATE STRSPLIT(line, '\\,') as entity;
grouped = GROUP data BY entity.$6;
countOccurance = FOREACH grouped GENERATE group as entity.$6,COUNT(data);
DUMP countOccurance;

1 个答案:

答案 0 :(得分:1)

自从我和猪做过任何事以来已经有一段时间了,但我认为你可以做到。

data = LOAD date USING pigstorage(',') AS (URL:chararray, FULLNAME:chararray, ..., COMPANYNAME:chararray);
data = FOREACH (GROUP data BY COMPANYNAME) GENERATE COUNT(data.COMPANYNAME), data.URL, data.FULLNAME, data.COMPANYNAME;
DUMP data;

当然,用其他列名替换....