我正在尝试将XML格式的销售数据加载到Hive表中。 以下是数据的一小部分样本。
我知道,如果我将下面的数据分离到多个表中,然后根据需要将它们联接,则可以将其加载到Hive中。但是只是想知道我是否可以将它们加载到单个表中,并且预期的输出应类似于所附的屏幕截图。
请帮助我了解我应该使用的表结构,以及如何有效地使用侧面视图爆炸选项来实现此目的。
样本数据:
<Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>
所需的输出
表结构:
CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);
输出: enter image description here
在下面的查询中尝试读取数据,它没有以我想要的方式显示结果。
select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;
结果:
答案 0 :(得分:0)
爆炸数组对象的工作方式类似于交叉联接。 因此,如果您有3列,每列包含2个元素的数组,则对所有列应用explode将为您提供8行。
您不能将一个对象从数组映射到另一个。
实际上,您可以使用posexplode
来给每个元素index
。您可以根据条件使用它来加入。但是,当您有多列并且每列的数组大小不同时,这很棘手。
解决方案
posexplode
。对于你的情况,这是行不通的。所以struct
。
如果您没有太多复杂的xml,则可以实现。但是,在将文件转换为复杂数据类型时,xmlSerde
不如JSONserde
好。 因此,对于您而言,最佳解决方案是。
NiFi
或其他技术。 JSONserde
创建Hive表并加载此文件。 用于XML的JSON
{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}
如果文件中包含制表符或其他空格,则 JsonSerde
可能会出现错误。因此,始终最好将它们删除。
蜂巢表
create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
BsktNo:string,
DateTime:string,
OpID:string,
TillNo:string,
Itm:array<struct<
GTIN:string,
ItmDsc:string,
ItmSeq:string,
ItmProm:struct<
OfferID:string,
PromCD:string
>
>
>
>
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';
SELECT Version,
StoreId,
basket.bsktno,
basket.tillno,
basket.`datetime`,
item.itmseq,
item.itmdsc,
item.gtin,
item.itmprom.offerid,
item.itmprom.promcd
FROM temp.test_json
lateral view explode(bskt) b AS basket
lateral view explode(basket.itm) i AS item
答案 1 :(得分:0)
感谢详细的解决方案。我对其进行了测试,并且效果很好。 我尝试了一种类似的方法,直接使用XML serde从XML读取数据。
我的挑战:
1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.
下面是Hive表结构和我用来读取数据的查询。 我能够成功爆炸第一级阵列(Bskt),没有任何问题。
但是,当我尝试爆炸第二级数组(Itm)时,它将为'Itm'中的所有字段返回NULL结果。
查询或表结构本身有问题吗?
create external table nanda_scan_xml (
Version string,
StoreId string,
Bskt array<struct<
Bskt:struct<
DateTime:string,
TillNo:string,
BsktNo:string,
Itm:array<struct<
Itm:struct<
ItmSeq:string,
GTIN:string,
ItmDsc:string,
DeptCD:string,
ItmCD:string,
SalesQTY:string,
SalesExGST:string,
Points:string,
CostExGST:string,
GSTRate:string,
DiscAmtExGST:string,
ItmProm:struct<
PromCD:string,
OfferID:string
>
>
>
>
>
>
>
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.Version" = "/Store/Version/text()",
"column.xpath.StoreId" = "/Store/StoreId/text()",
"column.xpath.Bskt" = "/Store/Bskt"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
tblproperties
(
"xmlinput.start" = "<Store>",
"xmlinput.end" = "</Store>"
);
查询:
1)对于运行良好的Bskt:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket;
结果:
enter image description here 2)在单个查询中尝试两个侧面爆炸时:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno,
item.Itm.ItmSeq,
item.Itm.ItmDsc,
item.Itm.GTIN,
item.Itm.itmprom.OfferID,
item.Itm.itmprom.PromCD
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;
结果:
3)查询:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno,
item.Itm.ItmSeq,
item.Itm.ItmDsc,
item.Itm.GTIN,
item.Itm.itmprom.OfferID,
item.Itm.itmprom.PromCD
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;
错误: