这是我的整个剧本。它应该查看Project Gutenberg etext并删除页眉和页脚文本,只留下书的实际文本,然后可以用于进一步分析。
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;
footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;
blocks = JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';
它失败了&#34; ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
&#34;
如果我尝试运行脚本的一个子集,它就会很好,直到最后一行。如果尝试运行前5行加上注释掉的STORE
行,那很好。如果我运行接下来的3行加上下一个注释掉的STORE
行,它就会崩溃。如果我禁用STORE行的EITHER,它可以正常工作。所以每个人STORE
声明都没问题。他们都? ERROR 2017
!有什么建议?我尝试了两种不同的发行版,一种来自Hortonworks,另一种来自Cloudera,从他们各自的网站上新下载的干净虚拟机映像。
答案 0 :(得分:0)
鉴于您的目标是删除页眉/页脚并且只是拥有该书,除了书籍和页眉/页脚之外,您不需要存储任何内容。我认为你的问题是blocks = JOIN headers BY $0, footers BY $0;
,它只对加载一次的数据进行自我连接。我下载了War&amp;和平和这段代码对我有用。
$ pig -x local
# grunt>
ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();
footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();
/* Now re-load headers and footers for join */
h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';
答案 1 :(得分:0)
如果您将原始输入读取到两个不同的变量,它也应该可以正常工作。
ebook_header = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ebook_footer = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
并对这些过滤器应用相应的过滤器。我想这取决于是否最好一次读取它,创建两个输出并再次读入或读入两次。