Hive:嵌套JSON数据对Hive表的挑战

时间:2016-09-26 18:50:59

标签: json hive nested apache-spark-sql hive-serde

我正在尝试将深层嵌套的JSON数据加载到hive表中。让我告诉你们到目前为止我尝试了什么。

1-我有JSON文件,它们深深嵌套,就像结构数组一样,再次有结构字段。

2-我成功将此json数据加载到Spark Data框架中,并能够查看模式。此外,我使用来自spark shell的命令将该数据帧成功存储为hive表。

new org.apache.spark.sql.hive.HiveContext(sc).read.json(
"/user/alpha/test.json").saveAsTable("mywarehouse.patent_data_2001");

但是当我尝试任何查询时,例如select * from patent_data_2001 limit 1

它给我以下错误

FAILED: IllegalArgumentException Error: type expected at the position 4339 of 'array<struct<id:string>>:struct<claim:array<struc
.
.
.
.
.
,wi:string>,nb_file:string>>:struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>>' but 'stru
c' is found.

3-我尝试使用Hive serde而不是Spark SQL spllied jar使用来自spark shell的下面命令

 hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")

仍然是同样的错误。

它在hive仓库中创建表并加载数据但是当我尝试查询表或甚至描述表时它会给我错误。

4-假设此问题可能与Spark SQL和Hive兼容性有关。我想到了修复模式,因为表创建时只有表模式有问题。我试图手动修复架构,是的,这是一个很长的,真正的时间采取过程,并最终修复了架构。我使用Create table statement手动再次创建了配置单表,如下所示。

CREATE TABLE `patent_data_2001`(
  `abstract` array<struct<id:string>> COMMENT '', 
  `claims` struct<claim:array<struct<id:string,num:string>>,id:string> COMMENT '', 
  `country` string COMMENT '', 
  `date_produced` string COMMENT '', 
  `date_publ` string COMMENT '', 
  `description` string COMMENT '', 
  `drawings`      struct<figure:array<struct<id:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,orientation:string,wi:string>,num:string>>,id:string> COMMENT '', 
  `dtd_version` string COMMENT '', 
  `file` string COMMENT '', 
  `id` string COMMENT '', 
  `lang` string COMMENT '', 
  `status` string COMMENT '', 
  `table_external_doc` array<string> COMMENT '', 
  `us_bibliographic_data_grant` struct<application_reference:struct<appl_type:string,document_id:struct<country:string,`date`:string,doc_number:string>>,assignees:struct<assignee:array<struct<addressbook:struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string,role:string>,first_name:string,last_name:string,orgname:string,role:string>>>,classification_locarno:struct<edition:string,main_classification:string>,classification_national:array<struct<country:string,main_classification:string>>,classifications_cpc:struct<further_cpc:struct<classification_cpc:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>,combination_set:array<struct<combination_rank:array<struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>,rank_number:string>>,group_number:string>>>,main_cpc:struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,classifications_ipcr:struct<classification_ipcr:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_level:string,classification_status:string,classification_value:string,generating_office:struct<country:string>,ipc_version_indicator:struct<`date`:string>,main_group:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,examiners:struct<assistant_examiner:struct<first_name:string,last_name:string>,primary_examiner:struct<department:string,first_name:string,last_name:string>>,invention_title:string,number_of_claims:string,pct_or_regional_filing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>,us_371c124_date:struct<`date`:string>,us_371c12_date:struct<`date`:string>>,pct_or_regional_publishing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,priority_claims:struct<priority_claim:array<struct<country:string,`date`:string,doc_number:string,kind:string,sequence:string>>>,publication_reference:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,rule_47_flag:string,us_application_series_code:string,us_botanic:struct<latin_name:string,variety:string>,us_field_of_classification_search:struct<classification_national:array<struct<country:string,main_classification:string>>>,us_parties:struct<agents:struct<agent:array<struct<addressbook:array<struct<address:struct<country:string>,first_name:string,last_name:string,orgname:string>>,rep_type:string,sequence:string>>>,inventors:struct<inventor:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string>>,designation:string,sequence:string>>>,us_applicants:struct<us_applicant:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string>>,app_type:string,applicant_authority_category:string,designation:string,sequence:string>>>>,us_references_cited:struct<us_citation:array<struct<classification_national:array<struct<country:string,main_classification:string>>>>>,us_related_documents: struct < continuation: array< struct< relation: struct< child_doc: struct< document_id: struct < country: string,   `date`: string, doc_number: string > >, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string>>>>, continuation_in_part: array< struct < relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string,doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, division: array< struct< relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string >, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, reissue: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct < country: string,`date`:string,doc_number: string > ,parent_grant_document: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string >>, parent_status: string >>>>, related_publication: array< struct < document_id: array < struct< country: string,`date`:string,doc_number: string, kind: string>>>>, substitution: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_doc: struct< document_id: struct < country: string,`date`:string, doc_number: string >, parent_status: string >>>>, us_provisional_application: array< struct < document_id: struct < country: string,`date`:string, doc_number: string >>>, us_term_of_grant: struct < disclaimer: array < struct< text: string>>>>>, 
  `us_chemistry` array<struct<cdx_file:string,idref:string,mol_file:string>> COMMENT '', 
  `us_claim_statement` string COMMENT '', 
  `us_math` array<struct<idrefs:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,wi:string>,nb_file:string>> COMMENT '', 
  `us_sequence_list_doc` struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>> COMMENT '')
  ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001';

但我又得到了同样的错误。

5-我尝试使用下面链接中列出的Serde

https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/proofpoint/hive-serde
http://www.congiu.net/hive-json-serde/1.3.6/

但没有运气。

6-研究时我发现这可能是这些JIRA中提到的问题

**ArrayIndexOutOfBounds exception for deeply nested structs**
https://issues.apache.org/jira/browse/HIVE-3253

**Support nested structs over 24 levels.**
https://issues.apache.org/jira/browse/HIVE-9500

我用下面的serde

运行了表创建
ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true' )

但是我得到了同样的错误而没有其他细节。

我无法弄清楚为什么会发生这些错误以及为什么hive不让我在成功创建它时查询表。

任何帮助或建议都会非常有用。请帮忙。

非常感谢。

0 个答案:

没有答案