项目文件名字段使用' -tagFile'选项,使用PigStorage' -tagFile',Pig 0.14加载

时间:2016-04-12 19:21:58

标签: hadoop hive apache-pig emr elastic-map-reduce

Amazon EMR-4.5,Hadoop 2.7.2,Pig 0.14

我想在使用-tagFile选项加载后将文件名字段和选定字段投影到新关系。结果似乎没有意义。例子:

tagfile-test.txt(制表符分隔)

AAA    123    2016
BBB    456    2016
CCC    789    2016

抛负载

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
DUMP test;

(tagfile-test.txt,AAA,123,2016)
(tagfile-test.txt,BBB,456,2016)
(tagfile-test.txt,CCC,789,2016)

正确 - 生成f0,f1,f2

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f2;
DUMP project;

(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)

不正确 - 生成f0,f1,f3(结果与上面相同)

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f3;
DUMP project;

(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)

不正确 - 生成f0,f2,f3(确认)

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f2, f3;
DUMP project;

(tagfile-test.txt,AAA,2016)
(tagfile-test.txt,BBB,2016)
(tagfile-test.txt,CCC,2016)

看来Pig没有正确识别字段名称。我尝试使用相同结果的字段位置($ 0,$ 1,$ 2,$ 3)。

2 个答案:

答案 0 :(得分:1)

在使用带有pigstorage的tagFile选项时遇到了同样的问题,并通过在pig脚本中添加以下行来解决问题:

  

设置pig.optimizer.rules.disabled'LegMapKeyPrune';

ColumnMapKeyPrune在enter image description here

得到了很好的解释

答案 1 :(得分:0)

看起来这些字段用','分隔,但你使用'\ t'作为PigStorage中的分隔符。还要指定字段的数据类型。

尝试更改此

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);

test = LOAD 'tagfile-test.txt' USING PigStorage(',','-tagFile') AS (f0:chararray, f1:chararray, f2:int, f3:int);