Question

以下是我的代码示例。我正在尝试对旧约的一个wordcount进行演示。当我尝试通过amazons EMR运行此代码时，该步骤失败。我已将代码作为纯文本文件上传到EMR，并且我的所有路径都是正确的。

这是我的代码：

a = load 's3://joe-hadoop-first-try/oldtest/oldtest.txt' as (f1:chararray);
b = foreach a generate FLATTEN(TOKENIZE(f1)) as word;
c = group b by word;
d = FOREACH c GENERATE COUNT(b), group;
store d into 's3://joe-hadoop-first-try/wordcountoutput';

以下是错误输出：

3904 [main] ERROR org.apache.pig.PigServer  - exception during parsing: Error during parsing. <file s3://joe-hadoop-first-try/input/wordcountoldtest.txt, line 2, column 52>  mismatched input '$0' expecting RIGHT_PAREN

示例文本的开头如下所示：文本文件是纯文本格式的旧约。以下是一个开头的样本：The King Gutenberg EBook of The King James Bible这本电子书可供任何人免费使用，几乎没有任何限制。您可以根据本电子书附带的Project Gutenberg许可条款或在www.gutenberg.org在线复制，赠送或重新使用它。标题：King James Bible发布日期：2011年3月2日[电子书＃10 ] [这部詹姆斯国王圣经于1989年末由古腾堡计划发布]

此外，当文本文件仅包含：

时，仍会出现此错误

helloworld

用作输入

以下是使用模式的尝试解决方案：

a = load 's3://joe-hadoop-first-try/oldtest/oldtest.txt' as (f1:chararray);
b = foreach a generate FLATTEN(TOKENIZE(f1)) as word;
c = group b by word;
d = FOREACH c GENERATE COUNT(b), group;
store d into 's3://joe-hadoop-first-try/wordcountoutput';

此代码现在正常运行！纠正了所有错误

Answer 1

从评论中解决问题。

您可以使用架构加载数据，而不是内联转换。
某些Pig功能仅在大写时才能正常工作。
确保您的输入和输出指向正确的位置

因此，完整代码应如下所示

a = LOAD 's3://joe-hadoop-first-try/oldtest/oldtest.txt' AS (f1:chararray);
b = FOREACH a GENERATE FLATTEN(TOKENIZE(f1)) AS word;
c = GROUP b BY word;
d = FOREACH c GENERATE COUNT(b), group;
STORE d INTO 's3://joe-hadoop-first-try/wordcountoutput';

此外，默认情况下，LOAD操作将按标签分割数据。如果这不是正确的行为，可以使用PigStorage更改分隔符，如同换行符一样

a = LOAD 's3://joe-hadoop-first-try/oldtest/oldtest.txt' USING PigStorage('\n') AS (f1:chararray);

在Amazon EMR上运行Pig Word Count脚本获取错误

1 个答案: