用于复杂嵌套Json的Hive

时间:2014-04-22 13:15:55

标签: json hadoop hive

我有一个原始输入json片段('/home/user/testsample.json') -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}

我尝试使用Json serde来解析上面的json到我的hive列​​。但是,上面的1395786553381001不存在Serde可以映射到Hive列的格式,即此值不存在Key(因为Hive理解Json列/值后出现:)

因此我采用了Array类型方法并创建了一个表 -

CREATE TABLE mytesttable (
  key string, 
  columns array < array< string > >
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;

从mytesttable中选择列[0] [1]; 给出 -

{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}

上面看起来很干净,但是我还需要列[*] [2],即在Json hive列​​中进行进一步的转换。

我写了一个正则表达式hive查询来清理'/home/user/testsample.json'中存在的原始Json(假设它存在于table tablewithinputjson中)

SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;

以上查询返回 -

{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}

但是在这里,1395786553381001无法映射到配置单元列,因为它出现在之后,而不是之后:或者更具体地说,此值在没有键的情况下出现。 (我可以添加“test”:1395786553381001之前,但我不想自定义输入数据 - 因为a)太多的定制是我不满意的事情b)似乎不是一个好的解决方案c)它将是不必要的浪费我的hadoop集群空间和时间)

不要混淆任何进一步,我无法想出一个Hive表格式,它完全解析并映射原始Json片段中的所有字段。 欢迎任何建议。如果它看起来太混乱,请告诉我。

1 个答案:

答案 0 :(得分:2)

发布端到端解决方案。将JSON转换为hive表的逐步过程:

步骤1)如果不存在maven,则安装maven

>$ sudo apt-get install maven

步骤2)安装git(如果没有)

>sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git

步骤3)进入$ HOME / HIVE-JSON_Serde文件夹

步骤4)构建serde包

>sudo mvn -Pcdh5 clean package

步骤5)serde文件将在    的 $ HOME /蜂房JSON-SERDE / JSON-SERDE /目标/ JSON-SERDE-1.3.7-快照罐与 - dependencies.jar

步骤6)在配置单元中添加serde作为依赖jar

 hive> ADD JAR $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;

步骤7)在$ HOME / books.json中创建json文件(示例)

{"value": [{"id": "1","bookname": "A","properties": {"subscription": "1year","unit": "3"}},{"id": "2","bookname":"B","properties":{"subscription": "2years","unit": "5"}}]}

步骤8)在hive中创建tmp1表

 hive>CREATE TABLE tmp1 (
      value ARRAY<struct<id:string,bookname:string,properties:struct<subscription:string,unit:string>>>   
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 
    'mapping.value' = 'value'   
) 
STORED AS TEXTFILE;

步骤9)将数据从json加载到tmp1表

>LOAD DATA LOCAL INPATH '$HOME/books.json' INTO TABLE tmp1;

步骤10)创建一个tmp2表来执行tmp1的爆炸操作,这个中间步骤是将多级json结构分成多行 注意:如果您的JSON结构简单且单级,请避免执行此步骤

hive>create table tmp2 as 
 SELECT *
 FROM tmp1
 LATERAL VIEW explode(value) itemTable AS items;

步骤11)创建hive表并从tmp2 table

加载值
hive>create table books as 
select value[0].id as id, value[0].bookname as name, value[0].properties.subscription as subscription, value[0].properties.unit as unit from tmp2;

步骤12)删除tmp表

hive>drop table tmp1;
hive>drop table tmp2;

步骤13)测试蜂巢表

hive>select * from books;

输出:

id name subscription unit

1 B 1年3

2 B 2年5