How to count characters

时间:2017-04-10 01:18:24

标签: hadoop

I have few text files and I'm looking to count characters in those files but not all the characters. I have to only count how many times letter a, b and c have occurred in those files. I'm very new to Pig. Any help would be appreciated. Thanks!

1 个答案:

答案 0 :(得分:0)

使用通配符*将所有文件加载到chararray类型的字段中。将该行拆分为单词,然后转换为字母并计算它们。

A = LOAD '/path/text*.txt' AS (lines:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)lines)) AS words;
C = FOREACH B GENERATE FLATTEN(TOKENIZE(REPLACE(words,'','|'), '|')) AS letters;
D = FILTER C BY (letters matches '.*(a|b|c).*');
E = GROUP D BY letters;
F = FOREACH E GENERATE group,COUNT(D);
DUMP F;