我有一个配置单元查询,它使用 XPath 从XML返回一组数组。 我想将数组中的元素插入到hive表中。
hivexml表中的xml内容为:
<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>
返回数组集的查询是:
select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"
以上查询(数组集)的输出为:
["1","2","3","4","5"] [".net","html","css","php","c"] ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36
73182","3644669","3607050","3607013"]
我想将这些值插入到hive表中,如下所示:
1 .net 244006 3624959 3607476
2 html 602809 3673183 3673182
3 css 434937 3644670 3644669
4 php 1009113 3624936 3607050
5 c 236386 3624961 3607013
如果我在上面的选择查询中插入:
insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"
然后我收到一个错误:
NoMatchingMethodException没有类的匹配方法 带(数组)的org.apache.hadoop.hive.ql.udf.UDFToInteger。 可能的选择: FUNC (bigint) FUNC (布尔) FU NC (十进制(38,18)) FUNC (双) FUNC (浮动) FUNC (smallint) FUNC (字符串) FUNC (struct) FUNC (时间戳) FUNC (tinyin t) FUNC < / em>的(无效)
我认为我们不能像这样直接插入,我在这里缺少一些东西。谁能告诉我怎么做?也就是说,将这些值从数组插入表中。
答案 0 :(得分:2)
xpath _...(str,concat('/ tag / row [',pe.pos + 1,'] / @ ...))
create table hivexml (str string);
insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>');
select xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Id' )) as Id
,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName' )) as TagName
,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Count' )) as Count
,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId
,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId' )) as WikiPostId
from hivexml
lateral view posexplode (xpath(str,'/tag/row/@Id')) pe
;
+----+------------+---------+---------------+------------+
| id | tagname | count | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
| 1 | .net | 244006 | 3624959 | 3607476 |
| 2 | html | 602809 | 3673183 | 3673182 |
| 3 | javascript | 1274350 | 3624960 | 3607052 |
| 4 | css | 434937 | 3644670 | 3644669 |
| 5 | php | 1009113 | 3624936 | 3607050 |
| 8 | c | 236386 | 3624961 | 3607013 |
+----+------------+---------+---------------+------------+
答案 1 :(得分:1)
xpath(str,concat(&#39; / tag / row [&#39;,pe.pos + 1,&#39;] / @ *&#39;))
这是一种非常简洁的方法,可以将元素的所有值一起提取出来 令我感到惊讶的是,属性的顺序似乎不符合XML中的顺序,而是按字母顺序排列 - @ Count,@ ExcerptPostId,@ Id,@ TagName,@ WikiPostId
很遗憾,我不能将其视为合法解决方案,除非我知道字母属性顺序是有保证的。
select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values
from hivexml
lateral view posexplode (xpath(str,'/tag/row/@Id')) pe
;
-
["244006","3624959","1",".net","3607476"]
["602809","3673183","2","html","3673182"]
["1274350","3624960","3","javascript","3607052"]
["434937","3644670","4","css","3644669"]
["1009113","3624936","5","php","3607050"]
["236386","3624961","8","c","3607013"]
select row_values[2] as Id
,row_values[3] as TagName
,row_values[0] as Count
,row_values[1] as ExcerptPostId
,row_values[4] as WikiPostId
from (select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values
from hivexml
lateral view posexplode (xpath(str,'/tag/row/@Id')) pe
) x
;
+----+------------+---------+---------------+------------+
| id | tagname | count | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
| 1 | .net | 244006 | 3624959 | 3607476 |
| 2 | html | 602809 | 3673183 | 3673182 |
| 3 | javascript | 1274350 | 3624960 | 3607052 |
| 4 | css | 434937 | 3644670 | 3644669 |
| 5 | php | 1009113 | 3624936 | 3607050 |
| 8 | c | 236386 | 3624961 | 3607013 |
+----+------------+---------+---------------+------------+
答案 2 :(得分:1)
split + str_to_map
select vals["Id"] as Id
,vals["TagName"] as TagName
,vals["Count"] as Count
,vals["ExcerptPostId"] as ExcerptPostId
,vals["WikiPostId"] as WikiPostId
from (select str_to_map(e.val,' ','=') as vals
from hivexml
lateral view posexplode(split(translate(str,'"',''),'/?><row')) e
where e.pos <> 0
) x
;
+----+------------+---------+---------------+------------+
| id | tagname | count | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
| 1 | .net | 244006 | 3624959 | 3607476 |
| 2 | html | 602809 | 3673183 | 3673182 |
| 3 | javascript | 1274350 | 3624960 | 3607052 |
| 4 | css | 434937 | 3644670 | 3644669 |
| 5 | php | 1009113 | 3624936 | 3607050 |
| 8 | c | 236386 | 3624961 | 3607013 |
+----+------------+---------+---------------+------------+
答案 3 :(得分:1)
如果数据是XML文档
下载XML serdeadd jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
create external table hivexml_ext
(
Id string
,TagName string
,Count string
,ExcerptPostId string
,WikiPostId string
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.Id" = "/row/@Id"
,"column.xpath.TagName" = "/row/@TagName"
,"column.xpath.Count" = "/row/@Count "
,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId"
,"column.xpath.WikiPostId" = "/row/@WikiPostId"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/hive/warehouse/hivexml'
tblproperties
(
"xmlinput.start" = "<row"
,"xmlinput.end" = "/>"
)
;
select * from hivexml_ext as x
;
+------+------------+---------+-----------------+--------------+
| x.id | x.tagname | x.count | x.excerptpostid | x.wikipostid |
+------+------------+---------+-----------------+--------------+
| 1 | .net | 244006 | 3624959 | 3607476 |
| 2 | html | 602809 | 3673183 | 3673182 |
| 3 | javascript | 1274350 | 3624960 | 3607052 |
| 4 | css | 434937 | 3644670 | 3644669 |
| 5 | php | 1009113 | 3624936 | 3607050 |
| 8 | c | 236386 | 3624961 | 3607013 |
+------+------------+---------+-----------------+--------------+
答案 4 :(得分:0)
问题是XPath函数会返回独立数组中每个请求的所有匹配结果而不加入它们。如果它适合您,您可以使用Pig作为批处理模型,可以将过程简化为单个步骤:
REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray);
B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0,
XPathAll(x, 'row/@TagName',false,false).$0,
XPathAll(x, 'row/@Count',false,false).$0,
XPathAll(x, 'row/@ExcerptPostId',false,false).$0,
XPathAll(x, 'row/@WikiPostId',false,false).$0;
DUMP B;
(1,.net,244006,3624959,3607476)
(2,html,602809,3673183,3673182)
(3,javascript,1274350,3624960,3607052)
(4,css,434937,3644670,3644669)
(5,php,1009113,3624936,3607050)
(8,c,236386,3624961,3607013)
STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer();