使用HIVE从JSON中提取字段

时间:2017-04-12 10:25:19

标签: json hadoop hive

我的hive代码中有一个问题。我想从使用HIVE中提取JSON数据。以下是示例json格式

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

我想获得以下字段

  • 版本
  • type
  • 车辆
  • ts
  • 创始人

问题是创始人和州是一个阵列“版本” 任何人都可以帮助如何摆脱这个? 有些时候而不是版本化的其他东西可能来了

例如 有时我的数据会像

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

在下面添加一些示例数据:

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}

我需要将这些数据放入基于版本的各种表中,如果它是“BOX”放在一个表中,如果它是“GAP”放另一个......

1 个答案:

答案 0 :(得分:1)

你可以使用json serde来获取所有字段

请按照以下步骤

1.从http://www.congiu.net/hive-json-serde/1.3/

下载json serde

2.添加json serde Jar

hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar;
Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path
Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar]

3.创建表

CREATE TABLE json_serde_table (
  Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

4.将json文件加载到表

hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table;
Loading data to table default.json_serde_table
Table default.json_serde_table stats: [numFiles=1, totalSize=234]
OK
Time taken: 0.877 seconds

5.在查询下面获取结果

hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table;
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job  -kill job_1491484583384_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-12 17:06:44,990 Stage-1 map = 0%,  reduce = 0%
2017-04-12 17:06:53,361 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1491484583384_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.8 sec   HDFS Read: 4891 HDFS Write: 50 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
1       ns      Mh-3412 2000-04-01T00:00:00.171Z        3.0     Florida
Time taken: 19.745 seconds, Fetched: 1 row(s)