Pig XmlLoader代码

时间:2015-05-11 06:11:37

标签: apache-pig

有人可以帮忙解决如何为这类数据编写pig xmlloader。

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="9" CreationDate="2012-01-17T21:03:59.200" Score="30" ViewCount="698" Body="&lt;p&gt;From the front end, &lt;code&gt;\[InvisibleApplication]&lt;/code&gt; can be entered as &lt;kbd&gt;Esc&lt;/kbd&gt; &lt;kbd&gt;@&lt;/kbd&gt; &lt;kbd&gt;Esc&lt;/kbd&gt;, and is an invisible operator for &lt;code&gt;@&lt;/code&gt;!. By an unfortunate combination of key-presses (there may have been a cat involved), this crept up in my code and I spent a great deal of time trying to figure out why in the world &lt;code&gt;f x&lt;/code&gt; was being interpreted as &lt;code&gt;f[x]&lt;/code&gt;. Example:&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;img src=&quot;http://i.stack.imgur.com/2Hxll.png&quot; alt=&quot;enter image description here&quot;&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Now there is no way I could've spotted this visually. The &lt;code&gt;*Form&lt;/code&gt;s weren't of much help either. If you're careful enough, you can see an invisible character between &lt;code&gt;f&lt;/code&gt; and &lt;code&gt;x&lt;/code&gt; if you move your cursor across the expression. Eventually, I found this out only by looking at the contents of the cell. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;There's also &lt;code&gt;\[InvisibleSpace]&lt;/code&gt;, &lt;code&gt;\[InvisibleComma]&lt;/code&gt; and &lt;code&gt;\[ImplicitPlus]&lt;/code&gt;, which are analogous to the above. There must be some use for these (perhaps internally), which is why it has been implemented in the first place. I can see the use for invisible space (lets you place superscripts/subscripts without needing anything visible to latch on to), and invisible comma (lets you use indexing like in math). It's the invisible apply that has me wondering...&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The only advantage I can see is to sort of visually obfuscate the code. Where (or how) is this used (perhaps internally?), and can I disable it? If it's possible to disable, will there be any side effects?&lt;/p&gt;&#xA;" OwnerUserId="5" LastEditorUserId="5" LastEditDate="2012-04-29T04:50:20.303" LastActivityDate="2013-10-22T10:48:32.560" Title="Usage of \[InvisibleApplication] and other related invisible characters" Tags="&lt;front-end&gt;&lt;syntax&gt;" AnswerCount="4" CommentCount="1" FavoriteCount="4" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="42" CreationDate="2012-01-17T21:10:34.680" Score="49" ViewCount="1347" Body="&lt;p&gt;&lt;code&gt;Cases&lt;/code&gt;, &lt;code&gt;Select&lt;/code&gt;,&lt;code&gt;Pick&lt;/code&gt; and &lt;code&gt;Position&lt;/code&gt; each have different syntaxes and purposes, but there are times when you can express the same calculation equivalently using either of them. So with this input:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;test = RandomInteger[{-25, 25}, {20, 2}]&#xA;&#xA;{{-15, 13}, {-8, 16}, {-8, -19}, {7, 6}, {-21, 9}, {-3, -25}, {21, -18}, {4, 4}, {2, -2}, {-24,  8}, {-17, -8}, {4, -18}, {22, -24}, {-4, -3}, {21, 0}, {19,    18}, {-23, -8}, {23, -25}, {14, -2}, {-1, -13}}&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;You can get the following equivalent results:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;Cases[test, {_, _?Positive}]&#xA;&#xA; {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}&#xA;&#xA;Select[test, #[[2]] &amp;gt; 0 &amp;amp;]&#xA;&#xA; {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}&#xA;&#xA;Pick[test, Sign[test[[All, 2]] ], 1]&#xA;&#xA; {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}&#xA;&#xA;&#xA;test[[Flatten@Position[test[[All, 2]], _?Positive] ]]&#xA;&#xA; {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;Are there performance or other considerations that should guide which you should use? For example, is the pattern-matching used in &lt;code&gt;Cases&lt;/code&gt; likely to be slower than the functional tests used in &lt;code&gt;Select&lt;/code&gt;? Are there any generic rules of thumb, or is testing the particular case you are using the only solution?&lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="8" LastEditDate="2012-01-20T04:45:34.940" LastActivityDate="2012-01-20T04:45:34.940" Title="What best practices or performance considerations are there for choosing between Cases, Position, Pick and Select?" Tags="&lt;performance-tuning&gt;&lt;pattern-matching&gt;" AnswerCount="4" CommentCount="0" FavoriteCount="28" />
</posts>

1 个答案:

答案 0 :(得分:0)

如果要加载,下面的xml数据是以下代码

A = LOAD '$input' using 
                       org.apache.pig.piggybank.storage.XMLLoader('row')
                       as (x:chararray);
B = FOREACH A GENERATE x;
dump B;