如何从XPath返回的数组中将数据插入到hive表中

时间:2017-02-24 09:37:17

标签: xml powershell hadoop xpath hive

我有一个配置单元查询,它使用 XPath 从XML返回一组数组。 我想将数组中的元素插入到hive表中。

hivexml表中的xml内容为:

<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>

返回数组集的查询是:

select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

以上查询(数组集)的输出为:

["1","2","3","4","5"] [".net","html","css","php","c"]   ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36
73182","3644669","3607050","3607013"]

我想将这些值插入到hive表中,如下所示:

1    .net    244006     3624959    3607476
2    html    602809     3673183    3673182
3    css     434937     3644670    3644669
4    php     1009113    3624936    3607050
5    c       236386     3624961    3607013

如果我在上面的选择查询中插入:

insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

然后我收到一个错误:

  

NoMatchingMethodException没有类的匹配方法   带(数组)的org.apache.hadoop.hive.ql.udf.UDFToInteger。   可能的选择: FUNC (bigint) FUNC (布尔) FU   NC (十进制(38,18)) FUNC (双) FUNC (浮动) FUNC (smallint)    FUNC (字符串) FUNC (struct) FUNC (时间戳) FUNC (tinyin t) FUNC < / em>的(无效)

我认为我们不能像这样直接插入,我在这里缺少一些东西。谁能告诉我怎么做?也就是说,将这些值从数组插入表中。

5 个答案:

答案 0 :(得分:2)

xpath _...(str,concat('/ tag / row [',pe.pos + 1,'] / @ ...))

create table hivexml (str string);

insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>');
select  xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Id'           )) as Id  
       ,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName'      )) as TagName
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Count'        )) as Count
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId'   )) as WikiPostId

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;
+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

答案 1 :(得分:1)

xpath(str,concat(&#39; / tag / row [&#39;,pe.pos + 1,&#39;] / @ *&#39;))

这是一种非常简洁的方法,可以将元素的所有值一起提取出来 令我感到惊讶的是,属性的​​顺序似乎不符合XML中的顺序,而是按字母顺序排列 - @ Count,@ ExcerptPostId,@ Id,@ TagName,@ WikiPostId

很遗憾,我不能将其视为合法解决方案,除非我知道字母属性顺序是有保证的。

select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;

-

["244006","3624959","1",".net","3607476"]
["602809","3673183","2","html","3673182"]
["1274350","3624960","3","javascript","3607052"]
["434937","3644670","4","css","3644669"]
["1009113","3624936","5","php","3607050"]
["236386","3624961","8","c","3607013"]
select  row_values[2] as Id
       ,row_values[3] as TagName
       ,row_values[0] as Count    
       ,row_values[1] as ExcerptPostId
       ,row_values[4] as WikiPostId

from   (select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

        from    hivexml
                lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
        ) x
;
+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

答案 2 :(得分:1)

split + str_to_map

select  vals["Id"]              as Id
       ,vals["TagName"]         as TagName
       ,vals["Count"]           as Count    
       ,vals["ExcerptPostId"]   as ExcerptPostId
       ,vals["WikiPostId"]      as WikiPostId

from   (select  str_to_map(e.val,' ','=') as vals

        from    hivexml 
                lateral view  posexplode(split(translate(str,'"',''),'/?><row')) e

        where   e.pos <> 0
        ) x
;
+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

答案 3 :(得分:1)

如果数据是XML文档

可以从https://github.com/01org/graphbuilder/blob/master/src/com/intel/hadoop/graphbuilder/preprocess/inputformat/XMLInputFormat.java

下载XML serde
add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;

create external table hivexml_ext
(
    Id              string
   ,TagName         string
   ,Count           string
   ,ExcerptPostId   string
   ,WikiPostId      string
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties 
(
    "column.xpath.Id"            = "/row/@Id"
   ,"column.xpath.TagName"       = "/row/@TagName"
   ,"column.xpath.Count"         = "/row/@Count    "
   ,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId"
   ,"column.xpath.WikiPostId"    = "/row/@WikiPostId"
)
stored as
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location        '/user/hive/warehouse/hivexml'
tblproperties 
(
    "xmlinput.start" = "<row"
   ,"xmlinput.end"   = "/>"
)
;

select * from hivexml_ext as x
;
+------+------------+---------+-----------------+--------------+
| x.id | x.tagname  | x.count | x.excerptpostid | x.wikipostid |
+------+------------+---------+-----------------+--------------+
|    1 | .net       |  244006 |         3624959 |      3607476 |
|    2 | html       |  602809 |         3673183 |      3673182 |
|    3 | javascript | 1274350 |         3624960 |      3607052 |
|    4 | css        |  434937 |         3644670 |      3644669 |
|    5 | php        | 1009113 |         3624936 |      3607050 |
|    8 | c          |  236386 |         3624961 |      3607013 |
+------+------------+---------+-----------------+--------------+

答案 4 :(得分:0)

问题是XPath函数会返回独立数组中每个请求的所有匹配结果而不加入它们。如果它适合您,您可以使用Pig作为批处理模型,可以将过程简化为单个步骤:

REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();

A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray);

B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0,
    XPathAll(x, 'row/@TagName',false,false).$0,
    XPathAll(x, 'row/@Count',false,false).$0,
    XPathAll(x, 'row/@ExcerptPostId',false,false).$0,
    XPathAll(x, 'row/@WikiPostId',false,false).$0;

DUMP B;

(1,.net,244006,3624959,3607476)
(2,html,602809,3673183,3673182)
(3,javascript,1274350,3624960,3607052)
(4,css,434937,3644670,3644669)
(5,php,1009113,3624936,3607050)
(8,c,236386,3624961,3607013)

STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer();