我的XML文件具有以下结构。
<records>
<record customer_id=0001>
<msg>
<demographics gender=F agecat=1 edcat=1 jobcat=2 empcat=2 retire=0 jobsat=1 marital=1 spousedcat=1 residecat=4 homeown=0 hometype=2 addresscat=2/>
<demographics gender=F agecat=3 edcat=5 jobcat=2 empcat=0 retire=0 jobsat=3 marital=2 spousedcat=1 residecat=4 homeown=0 hometype=3 addresscat=2/>
.....
</msg>
</record>
</records>
我希望最终结果看起来像在Hive中
0001 F 1 1 2 2 0 1 1 1 4 0 2 2
0001 F 3 5 2 0 0 3 2 1 4 0 3 2
0001 ....
0001 ....
我尝试过以下内容。
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;
CREATE external TABLE pbp (
gender string, agecat int, edcat int, jobcat int, empcat int, retire int, jobsat int, spousedcat int, homeown int, hometype int, addresscat int
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.gender"="/demographics/@gender",
"column.xpath.agecat"="/demographics/@agecat",
"column.xpath.edcat"="/demographics/@edcat",
"column.xpath.jobcat"="/demographics/@jobcat",
"column.xpath.empcat"="/demographics/@empcat",
"column.xpath.retire"="/demographics/@retire",
"column.xpath.jobsat"="/demographics/@jobsat",
"column.xpath.spousedcat"="/demographics/@spousedcat",
"column.xpath.homeown"="/demographics/@homeown",
"column.xpath.hometype"="/demographics/@hometype",
"column.xpath.addresscat"="/demographics/@addresscat",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<demographics",
"xmlinput.end"="/>"
);
但是,我无法弄清楚如何在创建pbp表时在另一列中插入相同的customer_id值。还请建议我正在做的是正确的方向,因为我是这个环境的新手。任何帮助表示赞赏!
答案 0 :(得分:0)
我知道这很晚了,如果您仍然在寻找答案,那么我们必须将Start和End标签从更改为从“ record tag”本身而不是“ demographics tag”开始,并将数据作为字符串数组使用。我在下面修改了您的create语句,
add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;
CREATE external TABLE pbp (
customer_id string, gender ARRAY<string>, agecat ARRAY<int>, edcat ARRAY<int>, jobcat ARRAY<int>, empcat ARRAY<int>, retire ARRAY<int>, jobsat ARRAY<int>, spousedcat ARRAY<int>, homeown ARRAY<int>, hometype ARRAY<int>, addresscat ARRAY<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.gender"="/record/msg/demographics/@gender",
"column.xpath.agecat"="/record/msg/demographics/@agecat",
"column.xpath.edcat"="/record/msg/demographics/@edcat",
"column.xpath.jobcat"="/record/msg/demographics/@jobcat",
"column.xpath.empcat"="/record/msg/demographics/@empcat",
"column.xpath.retire"="/record/msg/demographics/@retire",
"column.xpath.jobsat"="/record/msg/demographics/@jobsat",
"column.xpath.spousedcat"="/record/msg/demographics/@spousedcat",
"column.xpath.homeown"="/record/msg/demographics/@homeown",
"column.xpath.hometype"="/record/msg/demographics/@hometype",
"column.xpath.addresscat"="/record/msg/demographics/@addresscat",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record",
"xmlinput.end"="/record>"
);
稍后,您必须使用POSEXPLODE数组来逐行显示数据的非规范化视图以及customer_id(假定性别属性对于所有受众特征都是必不可少的)。
CREATE TABLE IF NOT EXISTS pbp_denormalized STORED AS <FILE_FORMAT> AS
SELECT
customer_id, n.gender, agecat[pos] AS agecat, edcat[pos] AS edcat, jobcat[pos] AS jobcat, empcat[pos] AS empcat, retire[pos] AS retire, jobsat[pos] AS jobsat, spousedcat[pos] AS spousedcat, homeown[pos] AS homeown, hometype[pos] AS hometype, addresscat[pos] AS addresscat
FROM pbp
LATERAL VIEW POSEXPLODE(gender) n AS pos, gender;
上面的代码应该可以为您带来预期的结果。