我想将xml数据存储到hive表,XML数据:
<servicestatuslist>
<recordcount>1266</recordcount>
<servicestatus id="435680">
<status_text>/: 61%used(9714MB/15975MB) (<80%) : OK</status_text>
<display_name>/ Disk Usage</display_name>
<host_name>zabbix.vshodc.com</host_name>
</servicestatus>
</servicestatuslist>
我已将jar文件添加到路径
hive> add jar /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar ;
Added /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar to class path
Added resource: /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar
我写了一个hive serDe查询:
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/@id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input/air.xml'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);
OK
Time taken: 1.609 seconds
当我发出select命令时,它没有显示表格的数据:
hive> select * from xml_AIR;
OK
Time taken: 3.0 seconds
上述代码有什么问题?请帮忙。
答案 0 :(得分:5)
在处理XML Serde时,我遇到了同样的问题。经过一番努力,我通过单独使用“加载数据”语句修复它,并避免在“CREATE”语句中添加“LOCATION”属性。 以下是我的XML数据。
<record customer_id="0000-JTALA">
<income>200000</income>
<demographics>
<gender>F</gender>
<agecat>1</agecat>
<edcat>1</edcat>
<jobcat>2</jobcat>
<empcat>2</empcat>
<retire>0</retire>
<jobsat>1</jobsat>
<marital>1</marital>
<spousedcat>1</spousedcat>
<residecat>4</residecat>
<homeown>0</homeown>
<hometype>2</hometype>
<addresscat>2</addresscat>
</demographics>
<financial>
<income>18</income>
<creddebt>1.003392</creddebt>
<othdebt>2.740608</othdebt>
<default>0</default>
</financial>
</record>
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);
OK
Time taken: 0.925 seconds
hive>
对于上面的create语句,我使用下面的“LOAD DATA”语句将XML文件中包含的数据加载到上面创建的表中。
hive> load data local inpath '/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml' overwrite into table xml_bank6;
Copying data from file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Copying file: file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Loading data to table default.xml_bank6
Table default.xml_bank6 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 500, raw_data_size: 0]
OK
Time taken: 0.879 seconds
hive>
最后,
hive> select * from xml_bank6;
OK
0000-JTALA 200000 {"empcat":"2","jobcat":"2","residecat":"4","retire":"0","hometype":"2","addresscat":"2","homeown":"0","spousedcat":"1","gender":"F","jobsat":"1","edcat":"1","marital":"1","agecat":"1"} {"default":"0","income":"18","othdebt":"2.740608","creddebt":"1.003392"}
Time taken: 0.149 seconds, Fetched: 1 row(s)
hive>
在上面的查询中,我建议将"xmlinput.start"
的值设为"<servicestatus id"
,而不是"<servicestatus"
,因为XML开始标记位于模式<servicestatus id="some data">
中。相信这会对你有所帮助。
答案 1 :(得分:0)
嗯,代码看起来不错。根据{{3}}中的示例,它应该适合您。
顺便说一句,您提供的代码中存在拼写错误。在表格定义中, status_test STRING 应为 status_text STRING ,反之亦然。
答案 2 :(得分:0)
整个XML文件应该是一行(即XML中没有换行符)。 (用于去除换行符的简单unix命令是tr'\ n \ r'''&lt; source.xml&gt; processed.xml。)
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
答案 3 :(得分:0)
根据Hive DDL文档,LOCATION子句需要hdfs_path。因此,请尝试仅指定目录,而不是指定XML文件的完整路径。通过在CREATE TABLE之后使用LOAD,您不能拥有外部表,在某些情况下这可能是一种有趣的方法。
答案 4 :(得分:0)
LOCATION只给目录而不是文件
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/@id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);