Question

我使用Java从URL读取RSS源，使用javax.xml.parsers.DocumentBuilder.parse(InputStream)解析DOM树，对其进行一些更改，然后使用org.w3c.dom.ls.LSSerializer.write(Node,LSOutput)序列化并输出结果。

我正在阅读的Feed是http://www.collaborationblueprint.com.au/blog/rss.xml。

feed是格式良好的XML，但序列化结果不是。
到目前为止，每次尝试都取消了一对方括号，打破了CData部分例如。如果源包含以下元素：

    <description><![CDATA[<p>some text</p>]]></description>

序列化结果如下所示，格式不正确：

    <description><![CDATA<p>some text</p>]></description>

我的代码如下。它包含在Lotus Domino代理中我该如何解决这个问题？

import java.io.InputStream;
import java.io.PrintWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLDecoder;
import java.util.HashMap;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;

import lotus.domino.*;

public class JavaAgent extends AgentBase {
    public void NotesMain() {
        try {
            org.w3c.dom.Document newDoc;
            DocumentBuilderFactory builderFactory;
            DocumentBuilder builder;
            Element docElem,tmpElem;
            Node tmpNode;

            Session session=getSession();
            AgentContext agentContext=session.getAgentContext();

            // Put URL arguments into a HashMap.
            Document doc=agentContext.getDocumentContext();
            String[] query=doc.getItemValueString("Query_String").split("&");

            HashMap<String,String> queryMap=new HashMap<String,String>(query.length);
            for (int i=0; i<query.length; i++) {
                int j=query[i].indexOf('=');
                if (j<0) queryMap.put(query[i],"");
                else queryMap.put(query[i].substring(0,j),URLDecoder.decode(query[i].substring(j+1),"UTF-8"));
            }

            // Get the "src" URL argument - this is the URL we're reading the feed from.
            String urlStr=queryMap.get("src");
            if (urlStr==null || urlStr.length()==0) {
                System.err.println("Error: source URL not specified.");
                return;
            }
            URL url;
            try {
                url=new URL(urlStr);
            } catch (Exception e) {
                System.err.println("Error: invalid source URL.");
                return;
            }

            HttpURLConnection conn=(HttpURLConnection)url.openConnection();
            InputStream is=conn.getInputStream();

            builderFactory=DocumentBuilderFactory.newInstance();
            builder=builderFactory.newDocumentBuilder();

            // Create a DocumentBuilder and parse the XML.
            builder=builderFactory.newDocumentBuilder();
            try {
                newDoc=builder.parse(is);
                is.close();
                conn.disconnect();
            } catch (Exception e) {
                is.close();
                conn.disconnect();
                System.err.println("XML parse exception: "+e.toString());
                return;
            }

            docElem=newDoc.getDocumentElement();
            docElem.setAttribute("xmlns:ibmwcm","http://purl.org/net/ibmfeedsvc/wcm/1.0");

            PrintWriter pw=getAgentOutput();
            pw.println("Content-type: text/xml");

            DOMImplementationRegistry registry=DOMImplementationRegistry
                .newInstance();
            DOMImplementationLS impl=(DOMImplementationLS)registry
                .getDOMImplementation("LS");
            LSOutput lso=impl.createLSOutput();
            lso.setCharacterStream(pw);
            LSSerializer writer=impl.createLSSerializer();
            writer.write(newDoc,lso);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Answer 1

我已经确定问题与序列化程序无关。

即使我做了一件简单的事情：

    pw.print("<description><![CDATA[<p>Some text</p>]]></description>");

内部方括号也会被剥离。

但是，如果我在CData中对小于号的符号进行编码，则问题就会消失。 E.g：

    pw.print("<description><![CDATA[&lt;p>Some text&lt;/p>]]></description>");

我还没有确定原因是PrintWriter类还是Lotus Domino，但无论我是否应该能够在解析和序列化之间修改XML来修复它。

如何在保留格式良好的CData部分的同时使用LSSerializer？

1 个答案: