在阿帕奇猪中扁平化

时间:2015-09-25 15:14:31

标签: apache-pig

我有一个如下所示的数据集:

DUMP A;
(10000,({(10000),(20000),(50000)},{(10000),(20000),(30000)}))
(20000,({(10000),(20000),(50000)},{(20000)},{(10000),(20000),(30000)}))
(30000,({(30000)},{(10000),(20000),(30000)}))
(40000,({(40000)},{(40000),(50000)}))
(50000,({(40000),(50000)},{(10000),(20000),(50000)}))
DESCRIBE A;
{foo: bytearray, bar_gp: (baz: {(foo: bytearray)})}

我最终希望它看起来像这样:

DUMP A;
(10000,{(10000),(20000),(50000),(30000)})
(20000,{(10000),(20000),(50000),(30000)})
(30000,{(10000),(20000),(30000)})
(40000,{(40000),(50000)})
(50000,{(40000),(50000),(10000),(20000)})

如果我尝试使用:

B = FOREACH A GENERATE $0, FLATTEN($1);
C = FOREACH B {D = FOREACH B GENERATE FLATTEN($1); D= DISTINCT D; GENERATE $0, D; }

但我一直收到错误:

expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)

如何获得所需的输出?我知道我可以使用UDF来解析它,但我想找到一个内置的解决方案。

1 个答案:

答案 0 :(得分:0)

我认为你需要在扁平化之前对BAG做出明确的分析。

B = FOREACH A {
   D = DISTINCT $1;
   GENERATE $0, FLATTEN(D)}