将一个表中的JSON数据插入到HIVE中的另一个表中

时间:2017-04-13 04:30:48

标签: json hadoop hive

我想根据数据上的关键字段将JSON数据从一个表插入到其他表中。

我的数据看起来像这样

  

{"舍入类型" {"版本":" 1"" OS":" MS&#34 ;, "类型":" NS""车辆":" MH-3412"" MOD" {&# 34;版本":[{" ABC" {" XYZ":" 123.dfer""创始人":&# 34; 3.0"" GHT":"佛罗里达""时尚":" fg45"" CDC&# 34;:"新""自由度":"是"" TS":" 2000-04-01T00:00 :00.171Z"}}]}}}

     

{"舍入类型" {"版本":" 1"" OS":" MS&#34 ;, "类型":" NS""车辆":" MH-3412"" MOD" {&# 34;版本":[{" GAP" {" XVY":" 123.dfer"" FAH":&# 34; 3.0"" GHT":"佛罗里达""时尚":" fg45"" CDC&# 34;:"新""自由度":"是"" TS":" 2000-04-01T00:00 :00.171Z"}}]}}}

     

{"舍入类型" {"版本":" 1"" OS":" MS&#34 ;, "类型":" NS""车辆":" MH-3412"" MOD" {&# 34;版本":[{" BOX" {" VOG":" 123.dfer"" FAH":&# 34; 3.0"" FAX":"佛罗里达""时尚":" fg45"" CDC&# 34;:"新""自由度":"是"" TS":" 2000-04-01T00:00 :00.171Z"}}]}}}

这里基于版本,它是" BOX"或" GAP"或" ABC"我想将特定JSON行上的字段填充到另一个表

例如:如果版本是" GAP"然后在一个表中填充特定行,如果它是" BOX"然后填充到另一个表...我的意思是BOX的所有行...

如何使用HIVE实现这一目标。请帮忙。

注意:我的JSON数据在一个表中作为具有类型字符串

的列

2 个答案:

答案 0 :(得分:2)

<强>演示

create table src (myjson string);

insert into src values
    ('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
   ,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
   ,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
;

create table trg_abc (myjson string);
create table trg_gap (myjson string);
create table trg_box (myjson string);
from src
insert into trg_abc select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].ABC') is not null
insert into trg_gap select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].GAP') is not null
insert into trg_box select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].BOX') is not null
;

答案 1 :(得分:-1)

首先,您需要将数据存储为hive表中的json:

我认为你的蜂巢表是外部的(通常是 - 用SHOW CREATE TABLE your_table检查) 如果是这样,整个数据集位于某些hdfs / s3路径中,例如s3a://your_bucket/your_jsons_location/

下载json-udf-1.3.7-jar-with-dependencies.jar并运行ADD JARS s3a://your_bucket/lib/json-udf-1.3.7-jar-with-dependencies.jar;
然后,您必须为每个json模式创建一个专用的json表:

CREATE EXTERNAL TABLE boxes
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<BOX:struct<VOG:string,FAH:string,FAX:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';

CREATE EXTERNAL TABLE gaps
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<GAP:struct<XVY:string,FAH:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';

CREATE EXTERNAL TABLE abcs
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<ABC:struct<XYZ:string,founder:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_jsons_location/';

现在,如果你要跑:

SELECT * FROM boxes;
SELECT * FROM gaps;
SELECT * FROM abcs;

您将看到每个表只正确解析了匹配的jsons(根据create statment中指定的模式)。 每个表中不匹配的都是NULL。

过滤掉不相关的记录:
SELECT * FROM abcs WHERE Rtype.mod.version[0].abc IS NOT NULL;

注意:这整个解释假设您的jsons存储在hive表的外部(特别是我使用了S3但它也可以是HDFS)