有人可以帮忙解决如何为这类数据编写pig xmlloader。
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="9" CreationDate="2012-01-17T21:03:59.200" Score="30" ViewCount="698" Body="<p>From the front end, <code>\[InvisibleApplication]</code> can be entered as <kbd>Esc</kbd> <kbd>@</kbd> <kbd>Esc</kbd>, and is an invisible operator for <code>@</code>!. By an unfortunate combination of key-presses (there may have been a cat involved), this crept up in my code and I spent a great deal of time trying to figure out why in the world <code>f x</code> was being interpreted as <code>f[x]</code>. Example:</p>

<p><img src="http://i.stack.imgur.com/2Hxll.png" alt="enter image description here"></p>

<p>Now there is no way I could've spotted this visually. The <code>*Form</code>s weren't of much help either. If you're careful enough, you can see an invisible character between <code>f</code> and <code>x</code> if you move your cursor across the expression. Eventually, I found this out only by looking at the contents of the cell. </p>

<p>There's also <code>\[InvisibleSpace]</code>, <code>\[InvisibleComma]</code> and <code>\[ImplicitPlus]</code>, which are analogous to the above. There must be some use for these (perhaps internally), which is why it has been implemented in the first place. I can see the use for invisible space (lets you place superscripts/subscripts without needing anything visible to latch on to), and invisible comma (lets you use indexing like in math). It's the invisible apply that has me wondering...</p>

<p>The only advantage I can see is to sort of visually obfuscate the code. Where (or how) is this used (perhaps internally?), and can I disable it? If it's possible to disable, will there be any side effects?</p>
" OwnerUserId="5" LastEditorUserId="5" LastEditDate="2012-04-29T04:50:20.303" LastActivityDate="2013-10-22T10:48:32.560" Title="Usage of \[InvisibleApplication] and other related invisible characters" Tags="<front-end><syntax>" AnswerCount="4" CommentCount="1" FavoriteCount="4" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="42" CreationDate="2012-01-17T21:10:34.680" Score="49" ViewCount="1347" Body="<p><code>Cases</code>, <code>Select</code>,<code>Pick</code> and <code>Position</code> each have different syntaxes and purposes, but there are times when you can express the same calculation equivalently using either of them. So with this input:</p>

<pre><code>test = RandomInteger[{-25, 25}, {20, 2}]

{{-15, 13}, {-8, 16}, {-8, -19}, {7, 6}, {-21, 9}, {-3, -25}, {21, -18}, {4, 4}, {2, -2}, {-24, 8}, {-17, -8}, {4, -18}, {22, -24}, {-4, -3}, {21, 0}, {19, 18}, {-23, -8}, {23, -25}, {14, -2}, {-1, -13}}
</code></pre>

<p>You can get the following equivalent results:</p>

<pre><code>Cases[test, {_, _?Positive}]

 {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}

Select[test, #[[2]] &gt; 0 &amp;]

 {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}

Pick[test, Sign[test[[All, 2]] ], 1]

 {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}


test[[Flatten@Position[test[[All, 2]], _?Positive] ]]

 {{-15, 13}, {-8, 16}, {7, 6}, {-21, 9}, {4, 4}, {-24, 8}, {19, 18}}
</code></pre>

<p>Are there performance or other considerations that should guide which you should use? For example, is the pattern-matching used in <code>Cases</code> likely to be slower than the functional tests used in <code>Select</code>? Are there any generic rules of thumb, or is testing the particular case you are using the only solution?</p>
" OwnerUserId="8" LastEditorUserId="8" LastEditDate="2012-01-20T04:45:34.940" LastActivityDate="2012-01-20T04:45:34.940" Title="What best practices or performance considerations are there for choosing between Cases, Position, Pick and Select?" Tags="<performance-tuning><pattern-matching>" AnswerCount="4" CommentCount="0" FavoriteCount="28" />
</posts>
答案 0 :(得分:0)
如果要加载,下面的xml数据是以下代码
A = LOAD '$input' using
org.apache.pig.piggybank.storage.XMLLoader('row')
as (x:chararray);
B = FOREACH A GENERATE x;
dump B;