如何使用Java在XML中解析CDATA

时间:2014-08-12 22:39:01

标签: java xml parsing dom cdata

在搜索现有的CDATA讨论后,我发现没有一个能够实现我的尝试。

是否可以在CDATA中解析标签不唯一的位置?

下面是我正在尝试检索CDATA块中每个字段的XML文档,该字段在下面的第5行中有多个感兴趣的字段(即数据加载,质量,状态,索引)。每个字段都标有" li" CDATA块中的标记(即使它是一个字符数据空间):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
 <name>area Area Date: 2014-07-31</name>
 <Placemark><name>P07L327</name><Point><coordinates>-96.26879,85.19125</coordinates></Point><description><![CDATA[<ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>]]></description><Style> id = "colorIcon"</Style></Placemark>
 <coordinates>-96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,45.14698,0 </coordinates>
</Document>
</kml>

目前的输出是这样的:

Name: <ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>

从CDATA块中的WITHIN,我的意图是为每个字段输出一个新行以及它的相应结果。

以下是迄今为止编写的代码,它给出了上面列出的当前输出:

    package com.lucy.seo;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.CharacterData;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Comment;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {

File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/Oracle_Java_Project/Test_Doc.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);

doc.getDocumentElement().normalize();

System.out.println("Root element :" + doc.getDocumentElement().getNodeName());

NodeList nList = doc.getElementsByTagName("Placemark");

System.out.println("----------------------------");

for (int temp = 0; temp < nList.getLength(); temp++) {
    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);
            System.out.println("Name: " + getCharacterDataFromElement(line));
    }
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}

感谢对此的任何帮助! - 谢谢!

2014年9月2日更新

使用最终解决方案更新了编辑。感谢大家在这里发布的解决方案和帮助。由于库冲突,解决方案被分解为两段代码/文件:

//First file which is input to the second file followed afterwards

import java.io.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.CharacterData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html"));
System.setOut(out);
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/raw_input.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);


//optional, but recommended
//read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();

NodeList nList = doc.getElementsByTagName("Placemark");

    //create a buffered reader that connects to the console, we use it so we can read lines
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    System.out.println("<html xlmns=http://www.w3.org/1999/xhtml>");

for (int temp = 0; temp < nList.getLength(); temp++) {
                Node nNode = nList.item(temp);
                Element eElement = (Element) nNode;

    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);

            System.out.println("<bracket><li>Name: " + eElement.getElementsByTagName("name").item(0).getTextContent() + "</li>");
            System.out.println("<description>Description: " + getCharacterDataFromElement(line) + "</description></bracket>");
    }
    System.out.println("</html>");

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}


//Second File
package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.logging.Level;
import java.util.logging.Logger;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/PA-PTH013_Output_Meters.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html");
Document doc = null;
    try {
        doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
    } catch (IOException ex) {
        Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
    }
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

Elements brackets = doc.getElementsByTag("bracket");

for (Element bracket : brackets) {
    Elements lis = bracket.select("li");

        for (Element li : lis){
        System.out.println(li.text());
        }
    break;
}
System.out.println();

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}

}

2 个答案:

答案 0 :(得分:2)

CDATA是XML解释引擎的标记,无论它们在开始和结束之间遇到什么,都应该被视为&#34;纯粹的&#34; (原始)字符数据。

因此,在某种程度上,它就像解析器的转义字符(可以包含许多字符的转义字符)。

因此,您无法找到一个XML解析器,它会将CDATA内部的任何内容报告为XML,因为规范说它必须将其作为字符流报告。 (因此:它绝不能将其解释为XML流,这实际上很好,因为没有任何内容要求内容确实是XML)。

无论如何,您的解析器和代码正在按预期工作。

但是,如果在您的情况下,您碰巧知道某个CDATA实例的内容确实是一个有效的XML实例,那么您可以为这个精确的内容打开一个新的Parser,并适当地处理它。

因此,您可以获得getCharacterDataFromElement(line)来电的输出,将其提供给documentBuilder,并使用这个新的Document实例来解析li元素的内容

答案 1 :(得分:0)

你的问题是矛盾的,因为CDATA是解析器的一个明确指令,不解析它在CDATA中看到的内容。因此,解析内容的最简单方法是不首先包含CDATA标记。<​​/ p>

但是,告诉解析器不要解析CDATA内容,你可以做的是将内容解压缩为文本,然后将文本作为第二个解析操作提交给解析器。