如何获得猪每行的单词数量?

时间:2014-11-17 02:56:01

标签: apache-pig

我正在试图找出他们在猪的文件中每行有多少单词。我已经加载和拆分了:

raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);

给了我一包每个包含一个单词的tulples。然后我去计算这些项目我得到一个错误:

counts = FOREACH words GENERATE COUNT(*);

我收到错误:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException

是因为有些行有空包吗?或者还有别的我做错了吗?

2 个答案:

答案 0 :(得分:0)

如果是空袋的问题,那么你可以尝试这样的事情:(未经测试)

raw = load file;

words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;

counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;

这里我们正在编写if-else条件来检查tokenized_words是空还是空,如果是,那么我们将零赋值给它,否则就是总计数。

答案 1 :(得分:0)

你能这样试试吗?

<强>输入

Hi hello how are you
this is apache pig
works

like a charm

<强> Pigscript:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;

<强>输出:

(5)
(4)
(1)
()
(3)