Pig中复杂数据类型的问题

时间:2015-02-13 02:21:13

标签: apache-pig

我有一个input.txt如下所示:

{"charId":1111,"encounters":[{"alias":"A","guid":192,"data1":0,"data2":0,"temporary":1},{"alias":"B","guid":952,"data1":0,"data2":0,"temporary":1}]}
{"charId":2222,"encounters":[{"alias":"C","guid":544,"data1":0,"data2":0,"temporary":1}]}
{"charId":3333,"encounters":[]}

我的问题是如何让输出看起来如下:

(1111, A, 192, 0, 0, 1)
(1111, B, 952, 0, 0, 1)
(2222, C, 544, 0, 0, 1)
(3333,  ,    ,  ,  ,  )

P.S。这是我的脚本,但它只输出前三行。

raw_data = LOAD 'input.txt' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);

a = FOREACH raw_data GENERATE json#'charId' AS (charId:chararray), FLATTEN(json#'encounters') AS (encounters:map[]);

b = FOREACH a GENERATE charId, encounters#'alias' AS alias, encounters#'guid' AS guid, encounters#'data1' AS data1, encounters#'data2' AS data2, encounters#'temporary' AS temporary;

非常感谢你的帮助。我真的很感激。

1 个答案:

答案 0 :(得分:0)

原因是,Flatten运算符将始终丢弃空映射,因此它不会包含在最终输出中。一种选择是你可以使用以下方法解决这个问题。我不会说这是最好的解决方案,但至少它会解决你的问题。

<强> PigScript:

raw_data = LOAD 'input.txt' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
a = FOREACH raw_data GENERATE json#'charId' AS (charId:chararray), json#'encounters' AS (encounters:map[]);
b = FOREACH raw_data GENERATE json#'charId' AS (charId:chararray),flatten(json#'encounters') AS (encounters:map[]);
c = FILTER a By IsEmpty(encounters);
d = FOREACH c GENERATE charId,null AS alias,null AS guid,null AS data1,null AS data2,null AS temporary;
e = FOREACH b GENERATE charId, encounters#'alias' AS alias, encounters#'guid' AS guid, encounters#'data1' AS data1, encounters#'data2' AS data2, encounters#'temporary' AS temporary;
f = UNION e,d;
dump f;

<强>输出:

(1111,A,192,0,0,1)
(1111,B,952,0,0,1)
(2222,C,544,0,0,1)
(3333,,,,,)