使用apache pig处理xml文件

时间:2013-08-20 08:04:26

标签: xml-parsing apache-pig

我有这样的xml文件:

<CATALOG>
<CD>
<TITLE>hadoop developer</TITLE>
<ARTIST>ajay</ARTIST>
<COUNTRY>india</COUNTRY>
<COMPANY>ITC</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>2013</YEAR>
</CD>
</CATALOG>

我使用了一些正则表达式,但我不知道为什么没有获得所需的输出...我的代码如下:
     **注册/usr/lib/pig/piggybank.jar

A = load 'input.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray);
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\n*<CD>\n<TITLE>(.*)</TITLE>\n*<ARTIST>(.*)</ARTIST>\n*<COUNTRY>(.*)</COUNTRY>\n*<COMPANY>(.*)</COMPANY>\n*<PRICE>(.*)</PRICE>\n*<YEAR>(.*)</YEAR>\n*</CD>\\n*</CATALOG>')) as (name:chararray, words:chararray);**

我的输出如下:

2013-08-20 12:40:24,043 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

2013-08-20 12:40:24,044 [main] WARN
org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized

2013-08-20 12:40:24,047 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2013-08-20 12:40:24,047 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

它出了什么问题?谢谢。

3 个答案:

答案 0 :(得分:1)

试试这个已经过测试并且工作正确;

/user/hue/和该文件夹XMLcopy catalog.xml (your code)

中创建XML文件夹
REGISTER piggybank.jar ;

xmldata = LOAD 'XML/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('CD') as(doc:chararray);

data = FOREACH xmldata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<CD>\\s*<TITLE>(.*)</TITLE>\\s*<AUTHOR>(.*)</AUTHOR>\\s*<COUNTRY>(.*)</COUNTRY>\\s*<COMPANY>(.*)</COMPANY>\\s*<PRICE>(.*)</PRICE>\\s*<YEAR>(.*)</YEAR>\\s*</CD>')) AS (title:chararray, author:chararray, country:chararray, company:chararray, price:chararray, year:chararray);

DESCRIBE data;

dump data;

答案 1 :(得分:0)

这个怎么样:

A = load 'input.xml' using org.apache.pig.piggybank.storage.XMLLoader('CD') 
    as (x:chararray);

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x, 
      '<CD>\\n\\s*<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)</COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>')) 
    as (title:chararray, artist:chararray, country:chararray, company:chararray, price:double, year:int);

答案 2 :(得分:0)

这应该有效。

A =  LOAD 'xml-files/cd.xml' using org.apache.pig.piggybank.storage.XMLLoader('CD') as (x:chararray);

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<CD>\\s*<TITLE>(.*)</TITLE>\\s*<ARTIST>(.*)</ARTIST>\\s*<COUNTRY>(.*)</COUNTRY>\\s*<COMPANY>(.*)</COMPANY>\\s*<PRICE>(.*)</PRICE>\\s*<YEAR>(.*)</YEAR>\\s*</CD>'));