将带有数组/结构的json文件和灵活的模式加载到Hive表中

时间:2018-02-03 01:05:46

标签: arrays json struct hive load

需要一些帮助将json文件加载到表中。以下是文件中某些json对象的示例:

| d1 | d2 | d3 | d4 | d5 | e1 | e2 | e3 | e4 | e5 | r1 | r2 | r3 | r4 | r5 |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|  3 |  1 |  2 |  1 |  3 |  1 |  3 |  2 |  3 |  1 |  1 |  4 |  2 |  2 |  1 |

正如您所看到的,对象之间的架构各不相同。所有对象中都存在一些并非所有属性。还有结构和数组。

这是我的创建表语句

{"asin": "0002000202", "title": "Black Berry, Sweet Juice: On Being Black and White in Canada", "price": 13.88, "imUrl": "http://ecx.images-amazon.com/images/I/51PQAYJ9EDL.jpg", "related": {"also_bought": ["0393333094"], "buy_after_viewing": ["0393333094", "1554685087"]}, "salesRank": {"Books": 3013713}, "categories": [["Books"]]}
{"asin": "0000041696", "title": "Arithmetic 2 A Beka Abeka 1994 Student Book (Traditional Arithmentic Series)", "price": 6.53, "imUrl": "http://ecx.images-amazon.com/images/I/41cGaan-BrL._SL500_.jpg", "related": {"also_viewed": ["B000KOYDUY", "B004GE1B7W", "B008SXBO88", "B001EH7Y02", "B000W7PN62", "B004H3G1X6", "B004WOEIXA", "B000AXWEEM", "0789478722", "B000MN2C56", "1402709269", "B001HHOKG0", "B000Y9TO1S", "1402711441", "0756609356", "0142400106", "1556616465", "0545021383", "B004LDD18A", "B000HZH18C", "1557996563", "B00CZTVUKI", "B001CXK8Y2", "B000QX6KY6"], "buy_after_viewing": ["B000KOYDUY", "B004GE1B7W", "B000LBXGRC", "0439827655"]}, "salesRank": {"Books": 2554321}, "categories": [["Books"]]}

我的加载声明:

create table amazon.products_test
(asin string,
title string,
description string,
brand string,
price float,
salesRank struct<category:string, rank:int> ,
imUrl string,
categories array<string>,
related struct<also_bought:string, also_viewed:string, buy_after_viewing:string, bought_together:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

我在这里尝试查询

load data inpath '/user/amazon/products_test.json'
overwrite into table amazon.products_test;

我有正确的数据类型吗? 有更好的serde吗? 我是否需要添加TBLPROPERTIES或SERDEPROPERTIES?

1 个答案:

答案 0 :(得分:1)

我找到了答案。怀疑,我需要使用不同的SERDE:

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

我看到一些论坛建议我可能需要使用此SERDE,但我不知道如何实施和添加以下内容:

https://github.com/rcongiu/Hive-JSON-Serde

  • 另外,我需要使用地图地图类型而不是salesRank的结构