Hive Xml serDe创建表

时间:2015-08-13 19:25:29

标签: xml hadoop hive

我的XML文件具有以下结构。

<records>
    <record customer_id=0001>
      <msg>
        <demographics gender=F agecat=1 edcat=1 jobcat=2 empcat=2 retire=0 jobsat=1 marital=1 spousedcat=1 residecat=4 homeown=0 hometype=2 addresscat=2/>
        <demographics gender=F agecat=3 edcat=5 jobcat=2 empcat=0 retire=0 jobsat=3 marital=2 spousedcat=1 residecat=4 homeown=0 hometype=3 addresscat=2/>
      .....
      </msg>
    </record>
</records>

我希望最终结果看起来像在Hive中

0001 F 1 1 2 2 0 1 1 1 4 0 2 2
0001 F 3 5 2 0 0 3 2 1 4 0 3 2
0001 ....
0001 ....

我尝试过以下内容。

add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;

CREATE external TABLE pbp (
  gender string, agecat int, edcat int, jobcat int, empcat int, retire int, jobsat int, spousedcat int, homeown int, hometype int, addresscat int
  )
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "column.xpath.gender"="/demographics/@gender",
  "column.xpath.agecat"="/demographics/@agecat",
  "column.xpath.edcat"="/demographics/@edcat",
  "column.xpath.jobcat"="/demographics/@jobcat",
  "column.xpath.empcat"="/demographics/@empcat",
  "column.xpath.retire"="/demographics/@retire",
  "column.xpath.jobsat"="/demographics/@jobsat",
  "column.xpath.spousedcat"="/demographics/@spousedcat",
  "column.xpath.homeown"="/demographics/@homeown",
  "column.xpath.hometype"="/demographics/@hometype",
  "column.xpath.addresscat"="/demographics/@addresscat",
  )
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  "xmlinput.start"="<demographics",
  "xmlinput.end"="/>"
);

但是,我无法弄清楚如何在创建pbp表时在另一列中插入相同的customer_id值。还请建议我正在做的是正确的方向,因为我是这个环境的新手。任何帮助表示赞赏!

1 个答案:

答案 0 :(得分:0)

我知道这很晚了,如果您仍然在寻找答案,那么我们必须将Start和End标签从更改为从“ record tag”本身而不是“ demographics tag”开始,并将数据作为字符串数组使用。我在下面修改了您的create语句,

add jar /usr/lib/hue/hivexmlserde-1.0.0.0.jar;

CREATE external TABLE pbp (
  customer_id string, gender ARRAY<string>, agecat ARRAY<int>, edcat ARRAY<int>, jobcat ARRAY<int>, empcat ARRAY<int>, retire ARRAY<int>, jobsat ARRAY<int>, spousedcat ARRAY<int>, homeown ARRAY<int>, hometype ARRAY<int>, addresscat ARRAY<int>
  )
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
  "column.xpath.customer_id"="/record/@customer_id",
  "column.xpath.gender"="/record/msg/demographics/@gender",
  "column.xpath.agecat"="/record/msg/demographics/@agecat",
  "column.xpath.edcat"="/record/msg/demographics/@edcat",
  "column.xpath.jobcat"="/record/msg/demographics/@jobcat",
  "column.xpath.empcat"="/record/msg/demographics/@empcat",
  "column.xpath.retire"="/record/msg/demographics/@retire",
  "column.xpath.jobsat"="/record/msg/demographics/@jobsat",
  "column.xpath.spousedcat"="/record/msg/demographics/@spousedcat",
  "column.xpath.homeown"="/record/msg/demographics/@homeown",
  "column.xpath.hometype"="/record/msg/demographics/@hometype",
  "column.xpath.addresscat"="/record/msg/demographics/@addresscat",
  )
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  "xmlinput.start"="<record",
  "xmlinput.end"="/record>"
);

稍后,您必须使用POSEXPLODE数组来逐行显示数据的非规范化视图以及customer_id(假定性别属性对于所有受众特征都是必不可少的)。

CREATE TABLE IF NOT EXISTS pbp_denormalized STORED AS <FILE_FORMAT> AS
SELECT
customer_id, n.gender, agecat[pos] AS agecat, edcat[pos] AS edcat, jobcat[pos] AS jobcat, empcat[pos] AS empcat, retire[pos] AS retire, jobsat[pos] AS jobsat, spousedcat[pos] AS spousedcat, homeown[pos] AS homeown, hometype[pos] AS hometype, addresscat[pos] AS addresscat
FROM pbp
LATERAL VIEW POSEXPLODE(gender) n AS pos, gender;

上面的代码应该可以为您带来预期的结果。