ERROR 2017简单的猪脚本

时间:2014-06-13 15:26:55

标签: hadoop apache-pig

这是我的整个剧本。它应该查看Project Gutenberg etext并删除页眉和页脚文本,只留下书的实际文本,然后可以用于进一步分析。

ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;

footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;

blocks =  JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';

book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';

它失败了&#34; ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.&#34;

如果我尝试运行脚本的一个子集,它就会很好,直到最后一行。如果尝试运行前5行加上注释掉的STORE行,那很好。如果我运行接下来的3行加上下一个注释掉的STORE行,它就会崩溃。如果我禁用STORE行的EITHER,它可以正常工作。所以每个人STORE声明都没问题。他们都? ERROR 2017!有什么建议?我尝试了两种不同的发行版,一种来自Hortonworks,另一种来自Cloudera,从他们各自的网站上新下载的干净虚拟机映像。

2 个答案:

答案 0 :(得分:0)

鉴于您的目标是删除页眉/页脚并且只是拥有该书,除了书籍和页眉/页脚之外,您不需要存储任何内容。我认为你的问题是blocks = JOIN headers BY $0, footers BY $0;,它只对加载一次的数据进行自我连接。我下载了War&amp;和平和这段代码对我有用。

$ pig -x local
# grunt>

ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();

footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();

/* Now re-load headers and footers for join */

h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);

blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';

答案 1 :(得分:0)

如果您将原始输入读取到两个不同的变量,它也应该可以正常工作。

ebook_header = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ebook_footer = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);

并对这些过滤器应用相应的过滤器。我想这取决于是否最好一次读取它,创建两个输出并再次读入或读入两次。