展平仅展开chararray

时间:2015-05-18 19:55:53

标签: hadoop apache-pig

我有以下数据(加载在变量A中):

(a1:a2:a3|a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3|c4:c5:c6|c7:c8:c9)

我希望我的最终输出如下:

(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)

这是我做的:

B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;

这将数据转换为:

(a1:a2:a3,a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3,c4:c5:c6,c7:c8:c9)

具有以下结构:

B: {splitted: chararray}

然而,当我尝试将这个chararray扁平化为单独的元组时,它只会吐出第一个项目。我已经尝试了几种不同的方法来获得我想要的输出,但我总是得到第一项。以下是我尝试过的几件事:

req_output = foreach B generate flatten(STRSPLIT(splitted, ','));

req_output = foreach B generate flatten(TOBAG(*));

在这两种情况下,我都得到以下输出:

(a1:a2:a3)
(b1:b2:b3)
(c1:c2:c3)

我不确定为什么会这样。如何将所有项目作为不同的元组?我对猪没什么经验,所以任何帮助都会受到赞赏。

1 个答案:

答案 0 :(得分:3)

relation B中,您只存储第一个项目(即splitted variable),这就是此问题的原因。你能从关系B中删除变量splitted吗?

B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;

B = foreach B generate flatten(STRSPLIT($0, '\\|'));

您可以通过多种方式解决此问题。

<强>输入

a1:a2:a3|a4:a5:a6
b1:b2:b3
c1:c2:c3|c4:c5:c6|c7:c8:c9

选项1:使用TOKENIZE

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'\\|'));
DUMP B;

选项2:使用STRSPLIT + TOBAG

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\|'));
C = FOREACH B GENERATE FLATTEN(TOBAG(*));
DUMP C;

选项3:使用STRSPLITTOBAG(仅限Pig版本0.14)

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLITTOBAG(line,'\\|'));
DUMP B;

<强>输出:

(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)