将xml解析为PIG中的各个行

时间:2015-12-09 10:01:14

标签: apache-pig

我有输入xml

<data xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" generationTimestamp="2015-08-07T15:04:01.550+02:00" schemaVersion="1.7" xsi:noNamespaceSchemaLocation="http://schemas.unfccc.int/inventoryreporting/simple1_7.xsd">  
<party name="AUS"/>  
<submission uid="F928563A-471D-40FB-B1E0-022401746319" version="3" name="AUS_2015_3_Inventory"/>  
<variables>
<variable name="[Enteric Fermentation][Other Sheep.Sheep][Emissions][CH4][kt][no source][no method][no target][no option][no type]" uid="39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA" userCreated="true" type="NUMBER">
  <years>
    <year name="2011" uid="6CE3F5C5-D464-48A5-9F2F-CDE450108F5F">
      <record>
        <value>492.74734020836445</value>
        <comments/>
      </record>
    </year>
    <year name="2010" uid="F18F7BF5-8AF7-47C1-A19F-584E84D2A7A4">
      <record>
        <value>469.78235318968376</value>
        <comments/>
      </record>
    </year>
    <year name="1994" uid="943A31F2-CDDD-49E3-99BF-C9CA082EB057">
      <record>
        <value>920.00059365049015</value>
        <comments/>
      </record>
    </year>
  </years>
</variable>  
</variables>  
</data>  

我希望最终输出为(提交名称,变量uid,年份名称,年份值)

(AUS_2015_3_Inventory,39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,2011,492.74734020836445)  
(AUS_2015_3_Inventory,39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,2010,469.78235318968376)  
(AUS_2015_3_Inventory,39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,1994,920.00059365049015)  

我试过这个猪代码,但它无法正常工作

-- register piggybank jar  
register piggybank.jar;  

-- load xml  
xmldata = load 'uidXML1.xml' using org.apache.pig.piggybank.storage.XMLLoader('data') as (xmldata_content:chararray);  

-- fetch submission name, variable uid and all year tags  
common_data = foreach xmldata generate FLATTEN(REGEX_EXTRACT_ALL(xmldata_content, '[\\s*\\S*]*<submission[\\s*\\S*]*name="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<variable[\\s*\\S*]*uid="(.*?)"[\\s*\\S*]*>\\s*<years>(.*?)</years>[\\s*\\S*]*')) as (sub_name:chararray,var_uid:chararray,years_data:chararray);  

-- split data on the basis of year  
years_split_up = foreach common_data generate sub_name, var_uid, FLATTEN(STRSPLIT(years_data,'</year>\\s*',0)) as (year_wise_xml:chararray);  

-- fetch submission name, variable uid, yare name and year value  
parsed_data = foreach years_split_up generate sub_name, var_uid, FLATTEN(REGEX_EXTRACT_ALL(year_wise_xml,'\\s*<year[\\s*\\S*]*name="(.*?)"[\\s*\\S*]*>[\\s*\\S*]*<value>(.*?)</value>[\\s*\\S*]*')) as (year_name:chararray, year_value:chararray);

以上猪代码的输出是

(AUS_2015_3_Inventory,39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,2011,492.74734020836445)  

我只获得第一年的标签,而不是获得其他两个标签。不知道我做错了什么。

我不想使用 STRSPITTOBAG 功能,因为我使用Pig 0.14中引入的Pig 0.12和STRSPLITTOBAG。

请帮帮我。

感谢。

dump common_data的OUTPUT;

(AUS_2015_3_Inventory,  39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,        <year name="2011" uid="6CE3F5C5-D464-48A5-9F2F-CDE450108F5F">          <record>            <value>492.74734020836445</value>            <comments/>          </record>        <year name="2010" uid="F18F7BF5-8AF7-47C1-A19F-584E84D2A7A4">          <record>            <value>469.78235318968376</value>            <comments/>          </record>        <year name="1994" uid="943A31F2-CDDD-49E3-99BF-C9CA082EB057">          <record>            <value>920.00059365049015</value>            <comments/>          </record>)  

转储输出年份_split_up

(AUS_2015_3_Inventory,39F4C4B0-ADA2-44D0-BBC0-B99A2B917FAA,        <year name="2011" uid="6CE3F5C5-D464-48A5-9F2F-CDE450108F5F">          <record>            <value>492.74734020836445</value>            <comments/>          </record>        </year>,        <year name="2010" uid="F18F7BF5-8AF7-47C1-A19F-584E84D2A7A4">          <record>            <value>469.78235318968376</value>            <comments/>          </record>        </year>,        <year name="1994" uid="943A31F2-CDDD-49E3-99BF-C9CA082EB057">          <record>            <value>920.00059365049015</value>            <comments/>          </record>        </year>)

0 个答案:

没有答案