Question

以下是使用API从LinkedIn导出的JSON文件。

{
   "numResults": 21,
   "people":  
             {  "total": 21,
                 "values":
                         {    "firstName": "Kshitiz",
                              "headline": "Interbank Derivatives  Bank Treasury",
                               "id": "aK8sji3rN7",
                               "industry": "Financial Services",
                               "lastName": "Jain",
                               "locations": {"country": {"code": "in"},
                               "name": "Mumbai Area, India"
                                            },
                               "numConnections": 500,
                               "pictureUrl": "http://m3.licT5WVdExyDEYDzE6cp0VwZ"
                          }
             }

}

将上述json文档保存在文本文件中并导入hadoop目录/ sample。

使用以下命令创建外部表。还添加了serde的JAR文件。

create external table linkedi(numResults int,people Struct<total:int,values:Struct<firstName:String,headline:String,id:String,industry:String,lastName:String,locations:Struct<country:Struct<code:String>,name:String>,numConnections:int,pictureUrl:String>>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '/sample';

运行select statement ( select * from linkedi;)时，会显示以下错误。

行异常java.io.IOException失败：java.lang.ClassCastException：org.json.JSONObject无法强制转换为[Ljava.lang.Object; 所用时间：0.213秒

显示错误的原因是什么？表的结构有错误吗？

Answer 1

我遇到了同样的麻烦。 Panshul是对的，Apache的SerDe不支持嵌套的JSON。但我仍然无法使用“hive-json-serde-0.2.jar”，至少不能使用最后一个版本的Hive。

我发现最好的方法是使用Openx的SerDe lib。简而言之，工作JAR 是json-serde-1.3-jar-with-dependencies.jar，可以找到here。这个与'STRUCT'一起使用，甚至可以忽略一些格式错误的JSON。在创建表的过程中，请包含以下代码：

 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
 LOCATION ...

如果需要，可以从here或here重新编译它。我尝试了第一个存储库，在添加必要的库后，它正在为我编译。存储库最近也已更新。

检查更多详情here。

Answer 2

您正在使用的SerDe不支持嵌套的JSON。您可以先尝试压扁JSON或尝试使用： hive-json-serde.googlecode.com/files/hive-json-serde-0.2.jar

如何将LinkedIn Json文件导入Hive外部表？

2 个答案: