所以我有这个XML文件:
<wd:File xmlns:wd="urn:com.workday/bsvc">
<wd:Report_Entry>
<wd:Job_Application wd:Descriptor="Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">35fe6a9de198812f1f1af872e801de13</wd:ID>
<wd:ID wd:type="Job_Application_ID">JOB_APPLICATION-6-882</wd:ID>
</wd:Job_Application>
<wd:Interviews_group>
<wd:Interview wd:Descriptor="Interview: Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">b6d03b1bf47c01e065c15380b437ef1f</wd:ID>
</wd:Interview>
<wd:Interview_Date>2019-03-15-07:00</wd:Interview_Date>
<wd:Interviewer wd:Descriptor="Chad Burke">
<wd:ID wd:type="WID">e83ebdbd2a0a01f021aff165a5e94cca</wd:ID>
<wd:ID wd:type="Employee_ID">105150473</wd:ID>
</wd:Interviewer>
<wd:Interview_Overall_Rating wd:Descriptor="3 - Highly Recommend">
<wd:ID wd:type="WID">dc2b915581810140cdfb4b12c924341f</wd:ID>
</wd:Interview_Overall_Rating>
<wd:Competency_Interview_Feedback>0</wd:Competency_Interview_Feedback>
</wd:Interviews_group>
<wd:Interviews_group>
<wd:Interview wd:Descriptor="Interview: Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">b6d03b1bf47c01e065c15380b437ef1f</wd:ID>
</wd:Interview>
<wd:Interview_Date>2019-03-15-07:00</wd:Interview_Date>
<wd:Interviewer wd:Descriptor="ADAM COOPER">
<wd:ID wd:type="WID">2b4ab86a080101f93043d3d28913c123</wd:ID>
<wd:ID wd:type="Employee_ID">225027082</wd:ID>
</wd:Interviewer>
<wd:Interview_Overall_Rating wd:Descriptor="3 - Highly Recommend">
<wd:ID wd:type="WID">dc2b915581810140cdfb4b12c924341f</wd:ID>
</wd:Interview_Overall_Rating>
<wd:Competency_Interview_Feedback>0</wd:Competency_Interview_Feedback>
</wd:Interviews_group>
</wd:Report_Entry>
</wd:File>
这里是一个称为Interviews_group的结构数组,内部有多个标签。我必须将此文件加载到配置单元中。因此,我正在使用配置单元XmlSerDe。我的配置单元查询如下:
Hive Query:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.applications_test
(Job_Application string,
Job_Application_ID string,
Interviews_group array<struct<Interview:string,Interview_date:string,Interviewer:String,Interviewr_ID:String,Interview_Overall_Rating:string,Competency_Interview_Feedback:string>>
)
ROW FORMAT SERDE 'com.walmart.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES("column.xpath.Job_Application"="//*[local-name()='Job_Application']/@*[local-name()='Descriptor']",
"column.xpath.Job_Application_ID"="//*[local-name()='Job_Application']/*[local-name()='ID'][contains(@*[local-name()='type'],'Job_Application_ID')]/text()",
"column.xpath.Interviews_group"="//*[local-name()='Interview']/@*[local-name()='Descriptor']"
)
STORED AS
INPUTFORMAT 'com.walmart.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/hive/hr/highsecure/raw/hiring/wd/job_applications/'
TBLPROPERTIES (
"xmlinput.start"="<wd:Report_Entry>",
"xmlinput.end"="</wd:Report_Entry>",
"xmlinput.nsstart"="<wd:File xmlns:wd='urn:com.workday/bsvc'>",
"xmlinput.nsend"="</wd:File>"
);
不幸的是,Interviews_group中的结果为我提供了空值,而不是填充xml值。
Output:
Interviews_group
[{"interview":null,
"interview_date":null,
"interviewer":null,
"interviewr_id":null,
"interview_overall_rating":null,
"competency_interview_feedback":null}]
我正在获取第1列和第2列的准确值,但在第3列(即Interviews_group)中却没有。
我该如何实现? 预期输出:
[{"interview":xml_value,
"interview_date":xml_value,
"interviewer":xml_value,
"interviewr_id":xml_value,
"interview_overall_rating":xml_value,
"competency_interview_feedback":aml_value}]
注意:xml_value是从xml指定的值。由于空间问题,我只替换为xml_value。