我有以下xml
<tns:TAG>
<REQUEST_ID>1</REQUEST_ID>
<APPLICATION_ID>2</APPLICATION_ID>
<EXTERNAL_SYSTEM_CODE>CF</EXTERNAL_SYSTEM_CODE>
<CCM_CHECK>
<CCM_CHECK_ID>44</CCM_CHECK_ID>
<CCM_CHECK_RESULT>21</CCM_CHECK_RESULT>
</CCM_CHECK>
</tns:TAG>
如果我从中删除tns:我可以创建一个hive表,它会像this
一样读取它但如果我离开它,我会收到以下错误
java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 42; The prefix "tns" for element "tns:WSCCMVerifyApplicationResultRequest" is not bound.
我唯一能想到的是事先解析文件并删除所有这些tns:
元素。我想像regexp_replace()
之类的东西会这样做。但我的问题是,还有其他方法吗?目前我创建了表格?
答案 0 :(得分:0)
在您的文件中包含namespace
,以消除此错误。像这样的东西
<tns:TAG xmlns:tns="http://localhost">
<REQUEST_ID>1</REQUEST_ID>
<APPLICATION_ID>2</APPLICATION_ID>
<强>更新强>
CREATE EXTERNAL TABLE myxml(
request_id string
, application_id string
, external_system_code string
, ccm_check map<string, string>
, verif_answers array<map<string, string>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.request_id"="//*[local-name()='REQUEST_ID']/text()",
"column.xpath.application_id"="//*[local-name()='APPLICATION_ID']/text()",
"column.xpath.external_system_code"="//*[local-name()='EXTERNAL_SYSTEM_CODE']/text()",
"column.xpath.ccm_check"="//*[local-name()='CCM_CHECK']/*",
"column.xpath.verif_answers"="//*[local-name()='VERIF_ANSWERS']/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'file:///home/cloudera/xmlfiles'
TBLPROPERTIES (
"xmlinput.start"="<tns:TAG",
"xmlinput.end"="</tns:TAG>"
);