正则表达式Java XML猪

时间:2013-12-19 04:31:49

标签: java xml regex apache-pig

请帮忙!你的几分钟可以节省我几个小时!!

我正在使用PIG获取一些信息。

<Content

<Name ><\Name> 
<Data ><\Data>
<Data ><\Data>
><\Content>

所以我用过:

abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content') AS (content: chararray);

我只需要一些具体的信息,我不知道可能性:

 abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content','Data') AS (content: chararray,data: chararray);

但我希望避免这种情况。我已经使用XMLLoader之后的正则表达式成功提取了我的其他信息,除了以下内容(只是一个可能的字符组合示例)

<Data Name="Buffer">{&quot;$type&quot;System.Collections.Generic'[!#%,:()!@-;[.}<\Data>

我的正则表达式:

1. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\Data\\> -- Unexpected character D at <\Data>
2. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\Data\\> -- I got nothing
3. \<Data Name=\"Buffer\"\>\{(.*)\}\<\\Data\> -- Unexpected character < at \<Data Name..
4. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\\Data\\> -- Unexpected character D at <\\\Data>

我打算得到:

&quot;$type&quot;System.Collections.Generic'[!#%,:()!@-;[.

编辑:

  1. 刚刚意识到一个巨大的错误,\应该是/

  2. 找到答案

    <Data Name=\\"Buffer\\">\\{(.*)\\}</Data\\>
    

1 个答案:

答案 0 :(得分:0)

解析此XML的更好方法是使用XPath Java API。

以下照片:

  

XXXYYYZZZ

     

111222333

import java.io.IOException;
import java.io.StringReader;
import java.util.AbstractList;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class ParseXML {

    public static void main(String... args) throws Exception {
        String input = "<Content><Data Name=\"Buffer\">XXXYYYZZZ</Data><Data Name=\"Buffer\">111222333</Data></Content>";
        String xpathExpression = "//Data[@Name='Buffer']";
        NodeList result = parseXML(input, xpathExpression);
        for (Node node : new NodeListWrapper(result)) {
            System.out.println(node.getFirstChild().getTextContent());
        }
    }

    private static NodeList parseXML(String input, String xpathExpression) throws Exception {
        StringReader reader = new StringReader(input);
        Document document = createDocument(input);
        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xpath = xpathFactory.newXPath();
        XPathExpression expression = xpath.compile(xpathExpression);
        return (NodeList) expression.evaluate(document, XPathConstants.NODESET);
    }

    private static Document createDocument(String input) throws ParserConfigurationException, SAXException, IOException {
        DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
        return documentBuilder.parse(new InputSource(new StringReader(input)));
    }

}

class NodeListWrapper extends AbstractList<Node> {
    private final NodeList nodeList;

    public NodeListWrapper(NodeList nodeList) {
        this.nodeList = nodeList;
    }

    @Override
    public Node get(int n) {
        return nodeList.item(n);
    }

    @Override
    public int size() {
        return nodeList.getLength();
    }
}

我已将答案源代码上传到GitHub here