使用Hive XML SerDe从xml的结构数组中检索null

时间:2019-03-20 19:18:39

标签: xml xpath hive xml-parsing xmlserde

所以我有这个XML文件:

<wd:File xmlns:wd="urn:com.workday/bsvc">
<wd:Report_Entry>
<wd:Job_Application wd:Descriptor="Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">35fe6a9de198812f1f1af872e801de13</wd:ID>
<wd:ID wd:type="Job_Application_ID">JOB_APPLICATION-6-882</wd:ID>
</wd:Job_Application>
<wd:Interviews_group>
<wd:Interview wd:Descriptor="Interview: Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">b6d03b1bf47c01e065c15380b437ef1f</wd:ID>
</wd:Interview>
<wd:Interview_Date>2019-03-15-07:00</wd:Interview_Date>
<wd:Interviewer wd:Descriptor="Chad Burke">
<wd:ID wd:type="WID">e83ebdbd2a0a01f021aff165a5e94cca</wd:ID>
<wd:ID wd:type="Employee_ID">105150473</wd:ID>
</wd:Interviewer>
<wd:Interview_Overall_Rating wd:Descriptor="3 - Highly Recommend">
<wd:ID wd:type="WID">dc2b915581810140cdfb4b12c924341f</wd:ID>
</wd:Interview_Overall_Rating>
<wd:Competency_Interview_Feedback>0</wd:Competency_Interview_Feedback>
</wd:Interviews_group>
<wd:Interviews_group>
<wd:Interview wd:Descriptor="Interview: Agent Smith - R-00620 Pharmacist Manager - Sam's -R2_RHO_H&W_008 (C-00549)">
<wd:ID wd:type="WID">b6d03b1bf47c01e065c15380b437ef1f</wd:ID>
</wd:Interview>
<wd:Interview_Date>2019-03-15-07:00</wd:Interview_Date>
<wd:Interviewer wd:Descriptor="ADAM COOPER">
<wd:ID wd:type="WID">2b4ab86a080101f93043d3d28913c123</wd:ID>
<wd:ID wd:type="Employee_ID">225027082</wd:ID>
</wd:Interviewer>
<wd:Interview_Overall_Rating wd:Descriptor="3 - Highly Recommend">
<wd:ID wd:type="WID">dc2b915581810140cdfb4b12c924341f</wd:ID>
</wd:Interview_Overall_Rating>
<wd:Competency_Interview_Feedback>0</wd:Competency_Interview_Feedback>
</wd:Interviews_group>
</wd:Report_Entry>
</wd:File>

这里是一个称为Interviews_group的结构数组,内部有多个标签。我必须将此文件加载到配置单元中。因此,我正在使用配置单元XmlSerDe。我的配置单元查询如下:

Hive Query:
    CREATE EXTERNAL TABLE IF NOT EXISTS staging.applications_test
    (Job_Application string,
    Job_Application_ID string,
    Interviews_group array<struct<Interview:string,Interview_date:string,Interviewer:String,Interviewr_ID:String,Interview_Overall_Rating:string,Competency_Interview_Feedback:string>>
    )
    ROW FORMAT SERDE 'com.walmart.spss.hive.serde2.xml.XmlSerDe'
    WITH SERDEPROPERTIES("column.xpath.Job_Application"="//*[local-name()='Job_Application']/@*[local-name()='Descriptor']",
    "column.xpath.Job_Application_ID"="//*[local-name()='Job_Application']/*[local-name()='ID'][contains(@*[local-name()='type'],'Job_Application_ID')]/text()",
    "column.xpath.Interviews_group"="//*[local-name()='Interview']/@*[local-name()='Descriptor']"

    )
    STORED AS
    INPUTFORMAT 'com.walmart.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION '/user/hive/hr/highsecure/raw/hiring/wd/job_applications/'
    TBLPROPERTIES (
    "xmlinput.start"="<wd:Report_Entry>",
    "xmlinput.end"="</wd:Report_Entry>",
    "xmlinput.nsstart"="<wd:File xmlns:wd='urn:com.workday/bsvc'>",
    "xmlinput.nsend"="</wd:File>"
    );

不幸的是,Interviews_group中的结果为我提供了空值,而不是填充xml值。

Output:
    Interviews_group   

         [{"interview":null,
        "interview_date":null,
        "interviewer":null,
        "interviewr_id":null,
        "interview_overall_rating":null,
        "competency_interview_feedback":null}]

我正在获取第1列和第2列的准确值,但在第3列(即Interviews_group)中却没有。

我该如何实现?  预期输出:

         [{"interview":xml_value,
        "interview_date":xml_value,
        "interviewer":xml_value,
        "interviewr_id":xml_value,
        "interview_overall_rating":xml_value,
        "competency_interview_feedback":aml_value}]

注意:xml_value是从xml指定的值。由于空间问题,我只替换为xml_value。

0 个答案:

没有答案