我正在尝试编写一个计算所有字符(特殊字符和字母)的猪脚本,并分别给出每个字符的计数。我一直在尝试使用以下脚本,但它只计算字母,但不包括特殊字符,如?并且:。请帮忙 !
A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
答案 0 :(得分:0)
只需使用'(.+)'
代替'\\w+'
,它就会为您提供文件中所有标点符号和字母的计数。
示例:
文件:[cat a.txt
]
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
代码:
A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
输出:cat part-r-00000
4 !
1 ;
3 ?
2 H
1 I
2 L
1 W
1 a
1 c
1 d
3 e
1 g
2 h
3 i
1 j
1 m
3 n
4 o
1 p
1 r
7 s
7 t
4 u
1 w
2 y
答案 1 :(得分:0)
你没有得到一些特殊字符的原因是TOKENIZE使用空格,双引号(“),昏迷(,)括号(()),星号(*)作为分隔符
所以当你在(chararray)$ 0上使用TOKENIZE时,令牌分隔符会丢失而不会被占用。
所以使用Ani Menon的示例数据,下面的脚本和输出。
<强>输入强>
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
<强> PigScript 强>
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
<强>输出强>
答案 2 :(得分:0)