如何在Hive中使用横向视图爆炸以获取XML数据格式?

时间:2018-11-20 00:54:50

标签: xml parsing hadoop hive explode

我正在尝试将XML格式的销售数据加载到Hive表中。 以下是数据的一小部分样本。

我知道,如果我将下面的数据分离到多个表中,然后根据需要将它们联接,则可以将其加载到Hive中。但是只是想知道我是否可以将它们加载到单个表中,并且预期的输出应类似于所附的屏幕截图。

请帮助我了解我应该使用的表结构,以及如何有效地使用侧面视图爆炸选项来实现此目的。

样本数据:

  <Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

所需的输出

enter image description here

表结构:

CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
    TBLPROPERTIES (
    "xmlinput.start"="<Store","xmlinput.end"="</Store>"
);

输出: enter image description here

在下面的查询中尝试读取数据,它没有以我想要的方式显示结果。

select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
    LATERAL VIEW explode(DateTime) table1 as basket_dtm 
    LATERAL VIEW explode(BsktNo) table2 as basket_number
    LATERAL VIEW explode(TillNo) table3 as till_number;

结果:

enter image description here

2 个答案:

答案 0 :(得分:0)

爆炸数组对象的工作方式类似于交叉联接。 因此,如果您有3列,每列包含2个元素的数组,则对所有列应用explode将为您提供8行。

您不能将一个对象从数组映射到另一个。

实际上,您可以使用posexplode来给每个元素index。您可以根据条件使用它来加入。但是,当您有多列并且每列的数组大小不同时,这很棘手。

解决方案

  • 如果要爆炸的列较少并且数组大小相同,请使用posexplode。对于你的情况,这是行不通的。所以
  • 将XML存储为复杂数据类型:将整个XML存储为复杂数据类型(不仅是数组),我说的是基于您的xml创建struct。 如果您没有太多复杂的xml,则可以实现。但是,在将文件转换为复杂数据类型时,xmlSerde不如JSONserde好。

因此,对于您而言,最佳解决方案是。

  • 将XML转换为JSON。您可以为此使用NiFi或其他技术。
  • 使用JSONserde创建Hive表并加载此文件。
  • 根据您的要求创建视图。

用于XML的JSON

{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}
如果文件中包含制表符或其他空格,则

JsonSerde可能会出现错误。因此,始终最好将它们删除。

蜂巢表

create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
                    BsktNo:string,
                    DateTime:string,
                    OpID:string,
                    TillNo:string,
                    Itm:array<struct<
                                        GTIN:string,
                                        ItmDsc:string,
                                        ItmSeq:string,
                                        ItmProm:struct<
                                                        OfferID:string,
                                                        PromCD:string
                                                        >

                                    >
                            >
                >
            >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';

enter image description here 创建视图

SELECT Version,
         StoreId,
         basket.bsktno,
         basket.tillno,
         basket.`datetime`,
         item.itmseq,
         item.itmdsc,
         item.gtin,
         item.itmprom.offerid,
         item.itmprom.promcd
FROM temp.test_json 
lateral view explode(bskt) b AS basket 
lateral view explode(basket.itm) i AS item

enter image description here

答案 1 :(得分:0)

感谢详细的解决方案。我对其进行了测试,并且效果很好。 我尝试了一种类似的方法,直接使用XML serde从XML读取数据。

我的挑战:

1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.

下面是Hive表结构和我用来读取数据的查询。 我能够成功爆炸第一级阵列(Bskt),没有任何问题。

但是,当我尝试爆炸第二级数组(Itm)时,它将为'Itm'中的所有字段返回NULL结果。

查询或表结构本身有问题吗?

create external table nanda_scan_xml  (
  Version string,
  StoreId string,
  Bskt array<struct<
                    Bskt:struct<
                                DateTime:string,
                                TillNo:string,
                                BsktNo:string,
                                Itm:array<struct<
                                                Itm:struct<
                                                    ItmSeq:string,      
                                                    GTIN:string,        
                                                    ItmDsc:string,      
                                                    DeptCD:string,      
                                                    ItmCD:string,       
                                                    SalesQTY:string,        
                                                    SalesExGST:string,      
                                                    Points:string,      
                                                    CostExGST:string,       
                                                    GSTRate:string,     
                                                    DiscAmtExGST:string,        
                                                    ItmProm:struct<     
                                                                    PromCD:string,      
                                                                    OfferID:string      
                                                                  >
                                                              >
                                                     >
                                            >
                                >
                    >
            >
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties 
(
    "column.xpath.Version"       = "/Store/Version/text()",
    "column.xpath.StoreId"       = "/Store/StoreId/text()",
    "column.xpath.Bskt"  = "/Store/Bskt"

)
stored as 
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
tblproperties 
(
    "xmlinput.start"    = "<Store>",
    "xmlinput.end"      = "</Store>"
);

查询:

1)对于运行良好的Bskt:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket;

结果:

enter image description here 2)在单个查询中尝试两个侧面爆炸时:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;

结果:

enter image description here

3)查询:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;

错误:

enter image description here