使用jsonserde在hive中加载复杂的json

时间:2015-11-25 21:57:43

标签: json hadoop hive hiveql

我正在尝试在hive中构建一个表来跟随json

{
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "hours": {
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "open": true,
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "review_count": 9,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "attributes": {
        "By Appointment Only": true,
        "Accepts Credit Cards": true, 
        "Good For Groups": 1
    },
    "type": "business"
}

我可以使用以下DDL创建一个表,但是在查询该表时会出现异常。

CREATE TABLE IF NOT EXISTS business (
 business_id string,
 hours map<string,string>,
 open boolean,
 categories array<string>,
 review_count int,
 name string,
 neighborhoods array<string>,
 attributes map<string,string>,
 type string
 )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

检索数据时的异常是&#34; ClassCast:无法将jsoanarray转换为json对象&#34; 。这个json的正确架构是什么?是否有任何可以帮助我生成正确的模式给定json与jsonserde一起使用?

1 个答案:

答案 0 :(得分:4)

我认为问题是hours,您定义为hours map<string,string>,但应该是map<string,map<string,string>

您可以使用一种工具从JSON数据自动生成配置单元表定义:https://github.com/quux00/hive-json-schema

但你可能想要调整它,因为当遇到JSON对象({}之间的任何东西)时,工具无法知道将其转换为配置单元mapstruct。 在您的数据上,该工具为我提供了这个:

CREATE TABLE x (
 attributes struct<accepts credit cards:boolean, 
       by appointment only:boolean, good for groups:int>,
 business_id string,
 categories array<string>,
 hours map<string:struct<close:string, open:string>
 name string,
 neighborhoods array<string>,
 open boolean,
 review_count int,
 type string
)

但看起来你想要这样的东西:

CREATE TABLE x (
     attributes map<string,string>,
     business_id string,
     categories array<string>,
     hours map<string,struct<close:string, open:string>>,
     name string,
     neighborhoods array<string>,
     open boolean,
     review_count int,
     type string
    ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

hive> load data local inpath 'json.data'  overwrite into  table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
  "good for groups":"1"}    
  vcNAWiLM4dR7D2nwwJ7nCA    
  ["Doctors","Health & Medical"]    
  {"tuesday":{"close":"17:00","open":"08:00"},
   "friday":{"close":"17:00","open":"08:00"}}   
    Eric Goldberg, MD   ["HELLO"]   true    9   business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>

虽然有几点说明:

  • 注意我使用了不同的JSON SerDe,因为我的系统上没有你使用的那个。我使用this one,我更喜欢它,因为我写了它。但是create语句应该与其他serde一样好。
  • 您可能希望将其中一些地图转换为结构,因为查询可能更方便。例如,attributes可以是结构,但您需要使用accepts credit cards中的空格映射名称。我的SerDe允许将json属性映射到不同的hive列​​名。这也是必需的,然后JSON使用属于hive关键字的属性,如'timestamp'或'create'。