请帮忙!你的几分钟可以节省我几个小时!!
我正在使用PIG获取一些信息。
<Content
<Name ><\Name>
<Data ><\Data>
<Data ><\Data>
><\Content>
所以我用过:
abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content') AS (content: chararray);
我只需要一些具体的信息,我不知道可能性:
abcd_ = LOAD 'parentFolder/*' USING org.apache.pig.piggybank.storage.XMLLoader('Content','Data') AS (content: chararray,data: chararray);
但我希望避免这种情况。我已经使用XMLLoader之后的正则表达式成功提取了我的其他信息,除了以下内容(只是一个可能的字符组合示例)
<Data Name="Buffer">{"$type"System.Collections.Generic'[!#%,:()!@-;[.}<\Data>
我的正则表达式:
1. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\Data\\> -- Unexpected character D at <\Data>
2. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\Data\\> -- I got nothing
3. \<Data Name=\"Buffer\"\>\{(.*)\}\<\\Data\> -- Unexpected character < at \<Data Name..
4. \\<Data Name=\\"Buffer\\"\\>\\{(.*)\\}\\<\\\Data\\> -- Unexpected character D at <\\\Data>
我打算得到:
"$type"System.Collections.Generic'[!#%,:()!@-;[.
编辑:
刚刚意识到一个巨大的错误,\应该是/
找到答案
<Data Name=\\"Buffer\\">\\{(.*)\\}</Data\\>
答案 0 :(得分:0)
解析此XML的更好方法是使用XPath Java API。
以下照片:
XXXYYYZZZ
111222333
import java.io.IOException;
import java.io.StringReader;
import java.util.AbstractList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class ParseXML {
public static void main(String... args) throws Exception {
String input = "<Content><Data Name=\"Buffer\">XXXYYYZZZ</Data><Data Name=\"Buffer\">111222333</Data></Content>";
String xpathExpression = "//Data[@Name='Buffer']";
NodeList result = parseXML(input, xpathExpression);
for (Node node : new NodeListWrapper(result)) {
System.out.println(node.getFirstChild().getTextContent());
}
}
private static NodeList parseXML(String input, String xpathExpression) throws Exception {
StringReader reader = new StringReader(input);
Document document = createDocument(input);
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
XPathExpression expression = xpath.compile(xpathExpression);
return (NodeList) expression.evaluate(document, XPathConstants.NODESET);
}
private static Document createDocument(String input) throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
return documentBuilder.parse(new InputSource(new StringReader(input)));
}
}
class NodeListWrapper extends AbstractList<Node> {
private final NodeList nodeList;
public NodeListWrapper(NodeList nodeList) {
this.nodeList = nodeList;
}
@Override
public Node get(int n) {
return nodeList.item(n);
}
@Override
public int size() {
return nodeList.getLength();
}
}
我已将答案源代码上传到GitHub here。