用于计算字符的猪脚本

时间:2016-04-29 04:33:57

标签: regex count apache-pig

我正在尝试编写一个计算所有字符(特殊字符和字母)的猪脚本,并分别给出每个字符的计数。我一直在尝试使用以下脚本,但它只计算字母,但不包括特殊字符,如?并且:。请帮忙 !

A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';

3 个答案:

答案 0 :(得分:0)

只需使用'(.+)'代替'\\w+',它就会为您提供文件中所有标点符号和字母的计数。

示例:

文件:[cat a.txt]

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!

代码:

A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';

输出:cat part-r-00000

4       !
1       ;
3       ?
2       H
1       I
2       L
1       W
1       a
1       c
1       d
3       e
1       g
2       h
3       i
1       j
1       m
3       n
4       o
1       p
1       r
7       s
7       t
4       u
1       w
2       y

答案 1 :(得分:0)

你没有得到一些特殊字符的原因是TOKENIZE使用空格,双引号(“),昏迷(,)括号(()),星号(*)作为分隔符

所以当你在(chararray)$ 0上使用TOKENIZE时,令牌分隔符会丢失而不会被占用。

所以使用Ani Menon的示例数据,下面的脚本和输出。

<强>输入

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!

<强> PigScript

A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B  BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;

<强>输出

Output

答案 2 :(得分:0)

这是一种解决方案:

lines = LOAD 'p.txt' AS (line: chararray);

characters = FOREACH lines GENERATE FLATTEN(STRSPLITTOBAG(line, '')) AS character;

charGroups = GROUP characters BY character;

result = FOREACH charGroups GENERATE group, COUNT($1);

store result into 'charcount.txt';

它将产生如下所示的输出:

enter image description here