Question

我有一个用例，正在获取输入json文件。该文件具有json数组-

[{json1},{json2},{json3},{json4}, .... 100 json responses]

json 1,2,3,4 ..结构的示例是

{“ AuthorisedSenderId”：“ 1”， “ cid”：“ 1”， “ id”：“ 1” }

我创建了一个表

CREATE EXTERNAL TABLE db1.sample_table(
authorisedsenderid string, 
cid string, 
id string)
ROW FORMAT SERDE 
  'org.apache.hive.hcatalog.data.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs:XXXX'

如果文件只有json1（无数组），我可以成功加载输入文件。

LOAD DATA INPATH 'filelocation' OVERWRITE INTO TABLE db1.sample_table

但是如果输入文件包含json数组，则无法加载。

能否请您帮助我定义CREATE TABLE命令以提取json数组？

Answer 1

您必须对文件进行少量修改才能使用JSON Serde进行处理。

当前内容：

[{"AuthorisedSenderId": "1", "cid":"1", "id":"1" },{"AuthorisedSenderId": "2", "cid":"2", "id":"2" }]

修改后的内容：：

{"test":[{"AuthorisedSenderId": "1", "cid":"1", "id":"1" },,{"AuthorisedSenderId": "2", "cid":"2", "id":"2" }]}

在开头添加了{"test":，并在末尾添加了}。

然后您可以创建如下所述的表。

蜂巢表

CREATE TABLE x (
  test array<struct<authorisedsenderid:string, cid:string, id:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

但是，如果您不想修改文件并且可以使用spark，则将变得更加容易，因为您无需更改json文件中的任何内容。

代码

df = spark.read.json("/tmp/sample_table/table/sample.json")
df.write.saveAsTable("db1.sample_table")

数据：

[{"AuthorisedSenderId": "1", "cid":"1", "id":"1" },{"AuthorisedSenderId": "2", "cid":"2", "id":"2" }]

输出

如何在Hive中加载包含Json对象数组的输入文件？

1 个答案: