猪JsonLoader()bug?

时间:2015-04-07 14:18:06

标签: json apache-pig

我正在尝试用猪加载一个json文件。我可以成功加载文件,但我发现了一个错误。

schema:
id,name,brand,color

数据:

{"id":2561,"name":"abc","brand":"Levis","color":"Blue"}
{"id":2562,"brand":"Adidas","color":"Black"}
{"id":2563,"name":"edf","brand":"Nike","color":"White"}

代码:

raw = LOAD '$INPUT_PATH' USING JsonLoader('
id:chararray,
name:chararray,
brand:chararray,
color:chararray
');

x = foreach raw generate id,brand;
dump x;

如果特定原始数据不包含模式中定义的所有字段,结果是错误的:(第二个原始应该是阿迪达斯而不是黑色)

(2561,Levis)
(2562,Black)
(2563,Nike)

以上是否有解决方法?

提前致谢

1 个答案:

答案 0 :(得分:2)

我建议您使用elephantbird代替JsonLoaderElephantbird将以key/value pair(i.e map)的形式存储输入json,即使输入json中缺少某些字段,也很容易提取所需的字段。

下载两个jars文件(elephant-bird-pig-4.1.jarelephant-bird-hadoop-compat-4.1.jar)并尝试以下方法。

<强> input.json

{"id":2561,"name":"abc","brand":"Levis","color":"Blue"}
{"id":2562,"brand":"Adidas","color":"Black"}
{"id":2563,"name":"edf","brand":"Nike","color":"White"}

<强> PigScript:

REGISTER /tmp/elephant-bird-pig-4.1.jar;
REGISTER /tmp/elephant-bird-hadoop-compat-4.1.jar;

A = LOAD 'input.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'brand' AS brand;
DUMP B;

<强>输出:

(2561,Levis)
(2562,Adidas)
(2563,Nike)