解析以下xml后,
<html>
<body>
<a>
<div>
<span>foo</span>
</div>
</a>
</body>
</html>
使用javax.xml.xpath解析的org.w3c.dom文档表示以下内容:
div
是a
a
是span
为什么会这样,我该如何正确解析这个xml?
这是我正在使用的代码,后面是用于创建Document对象的方法,后跟代码的输出。
String myxml = ""
+ "<html>"
+ "<body>"
+ "<a>"
+ "<div>"
+ "<span>foo</span>"
+ "</div>"
+ "</a>"
+ "</body>"
+ "</html>";
Document doc = HttpDownloadUtilities.getWebpageDocument_fromSource(myxml);
XPath xPath = XPathFactory.newInstance().newXPath();
Node node = ((Node)xPath.compile("//*[text() = 'foo']").evaluate(doc, XPathConstants.NODE));
System.out.println(" node tag: " + node.getNodeName());
System.out.println(" parent tag: " + node.getParentNode().getNodeName());
System.out.println("grandparent tag: " + node.getParentNode().getParentNode().getNodeName());
Set<Node> nodes = H.getSet((NodeList)xPath.compile("//*").evaluate(doc, XPathConstants.NODESET));
for (Node n : nodes) {
System.out.println();
try {
System.out.println("node: " + n.getNodeName());
} catch (Exception e) {
}
try {
System.out.println("child: " + n.getChildNodes().item(0).getNodeName());
} catch (Exception e) {
}
}
这是用于创建Document对象的方法:
public static Document getWebpageDocument_fromSource(String source) throws InterruptedException, IOException {
try {
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
try {
builder = builderFactory.newDocumentBuilder();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
TagNode tagNode = new HtmlCleaner().clean(source);
Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
return doc;
} catch (ParserConfigurationException ex) {
ex.printStackTrace();
return null;
}
}
输出:
node tag: span
parent tag: a
grandparent tag: div
node: html
child: head
node: head
node: body
child: html
node: html
child: body
node: body
child: a
node: a
node: div
child: a
node: a
child: span
node: span
child: #text
答案 0 :(得分:2)
html解析器很可能修复了无效的html。在a-tags内部不允许使用div-tags。只要你有Document-object,html就已经被解析并修复了。