Question

我对PigStorage及其-tagPath选项有一个有趣的行为，我不知道我做错了什么（错误的架构定义？）或者这是一个限制/错误在猪。

我的文件看起来像这样（最基本的，我能够想出来）：

A
B

现在我可以像这样加载和选择这个文件：

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';') AS (char: chararray);

DUMP vals

one_column = FOREACH vals GENERATE char;

DUMP one_column

结果：

(A)
(B)
(A)
(B)

但是，当我尝试使用-tagPath获取文件路径时（我访问整个数据文件夹时需要它），数据会正确加载到第一个变量中，但我无法从中选择一个列

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';', '-tagPath')
    AS (filepath: chararray, char: chararray);

DUMP vals

one_column = FOREACH vals GENERATE char;

DUMP one_column

结果：

(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)

但是，当我第一次读取没有架构的数据然后使用FOREACH添加架构时，它再次正常工作：

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';', '-tagPath');

vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;

DUMP vals_n

one_column = FOREACH vals GENERATE char;

DUMP one_column

结果：

(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)

那么有什么办法，我可以同时在-tagPath阶段使用LOAD和架构吗？

Answer 1

这种情况发生了，因为pig试图自动找出脚本中正在使用哪些列并仅加载这些列。当我们使用-tagFile或-tagPath时，似乎会感到困惑。

解决方案是run the pig script without this column detection：

pig -x mapreduce -t ColumnMapKeyPrune

不能在PigStorage LOAD中同时使用-tagPath和schema

1 个答案: