在搜索现有的CDATA讨论后,我发现没有一个能够实现我的尝试。
是否可以在CDATA中解析标签不唯一的位置?
下面是我正在尝试检索CDATA块中每个字段的XML文档,该字段在下面的第5行中有多个感兴趣的字段(即数据加载,质量,状态,索引)。每个字段都标有" li" CDATA块中的标记(即使它是一个字符数据空间):
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
<name>area Area Date: 2014-07-31</name>
<Placemark><name>P07L327</name><Point><coordinates>-96.26879,85.19125</coordinates></Point><description><![CDATA[<ol><li> Data Loaded: NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>]]></description><Style> id = "colorIcon"</Style></Placemark>
<coordinates>-96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,45.14698,0 </coordinates>
</Document>
</kml>
目前的输出是这样的:
Name: <ol><li> Data Loaded: NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>
从CDATA块中的WITHIN,我的意图是为每个字段输出一个新行以及它的相应结果。
以下是迄今为止编写的代码,它给出了上面列出的当前输出:
package com.lucy.seo;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.CharacterData;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Comment;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;
public class ReadXMLFile {
public static void main(String[] args ) throws Exception {
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/Oracle_Java_Project/Test_Doc.xml");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("Placemark");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Element element = (Element) nList.item(temp);
NodeList name = element.getElementsByTagName("description");
Element line = (Element) name.item(0);
System.out.println("Name: " + getCharacterDataFromElement(line));
}
}
public static String getCharacterDataFromElement(Element f) {
NodeList list = f.getChildNodes();
String data;
for(int index = 0; index < list.getLength(); index++){
if(list.item(index) instanceof CharacterData){
CharacterData child = (CharacterData) list.item(index);
data = child.getData();
if(data != null && data.trim().length() > 0)
return child.getData();
}
}
return "";
}
}
感谢对此的任何帮助! - 谢谢!
使用最终解决方案更新了编辑。感谢大家在这里发布的解决方案和帮助。由于库冲突,解决方案被分解为两段代码/文件:
//First file which is input to the second file followed afterwards
import java.io.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.CharacterData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class ReadXMLFile {
public static void main(String[] args ) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html"));
System.setOut(out);
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/raw_input.xml");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);
//optional, but recommended
//read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName("Placemark");
//create a buffered reader that connects to the console, we use it so we can read lines
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
System.out.println("<html xlmns=http://www.w3.org/1999/xhtml>");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
Element eElement = (Element) nNode;
Element element = (Element) nList.item(temp);
NodeList name = element.getElementsByTagName("description");
Element line = (Element) name.item(0);
System.out.println("<bracket><li>Name: " + eElement.getElementsByTagName("name").item(0).getTextContent() + "</li>");
System.out.println("<description>Description: " + getCharacterDataFromElement(line) + "</description></bracket>");
}
System.out.println("</html>");
//read a line from the console
String lineFromInput = in.readLine();
//output to the file a line
out.println(lineFromInput);
out.close();
}
public static String getCharacterDataFromElement(Element f) {
NodeList list = f.getChildNodes();
String data;
for(int index = 0; index < list.getLength(); index++){
if(list.item(index) instanceof CharacterData){
CharacterData child = (CharacterData) list.item(index);
data = child.getData();
if(data != null && data.trim().length() > 0)
return child.getData();
}
}
return "";
}
}
//Second File
package ReadXMLFile_part2;
import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.logging.Level;
import java.util.logging.Logger;
public class ReadXMLFile_part2 {
public static void main(String[] args) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/PA-PTH013_Output_Meters.xml"));
System.setOut(out);
System.out.println("*** JSOUP ***");
File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html");
Document doc = null;
try {
doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
} catch (IOException ex) {
Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
}
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
Elements brackets = doc.getElementsByTag("bracket");
for (Element bracket : brackets) {
Elements lis = bracket.select("li");
for (Element li : lis){
System.out.println(li.text());
}
break;
}
System.out.println();
//read a line from the console
String lineFromInput = in.readLine();
//output to the file a line
out.println(lineFromInput);
out.close();
}
}
答案 0 :(得分:2)
CDATA
是XML解释引擎的标记,无论它们在开始和结束之间遇到什么,都应该被视为&#34;纯粹的&#34; (原始)字符数据。
因此,在某种程度上,它就像解析器的转义字符(可以包含许多字符的转义字符)。
因此,您无法找到一个XML解析器,它会将CDATA内部的任何内容报告为XML,因为规范说它必须将其作为字符流报告。 (因此:它绝不能将其解释为XML流,这实际上很好,因为没有任何内容要求内容确实是XML)。
无论如何,您的解析器和代码正在按预期工作。
但是,如果在您的情况下,您碰巧知道某个CDATA实例的内容确实是一个有效的XML实例,那么您可以为这个精确的内容打开一个新的Parser,并适当地处理它。
因此,您可以获得getCharacterDataFromElement(line)
来电的输出,将其提供给documentBuilder
,并使用这个新的Document
实例来解析li
元素的内容
答案 1 :(得分:0)
你的问题是矛盾的,因为CDATA是解析器的一个明确指令,不解析它在CDATA中看到的内容。因此,解析内容的最简单方法是不首先包含CDATA标记。</ p>
但是,告诉解析器不要解析CDATA内容,你可以做的是将内容解压缩为文本,然后将文本作为第二个解析操作提交给解析器。