我使用Java从URL读取RSS源,使用javax.xml.parsers.DocumentBuilder.parse(InputStream)
解析DOM树,对其进行一些更改,然后使用org.w3c.dom.ls.LSSerializer.write(Node,LSOutput)
序列化并输出结果。
我正在阅读的Feed是http://www.collaborationblueprint.com.au/blog/rss.xml。
feed是格式良好的XML,但序列化结果不是。
到目前为止,每次尝试都取消了一对方括号,打破了CData部分
例如。如果源包含以下元素:
<description><![CDATA[<p>some text</p>]]></description>
序列化结果如下所示,格式不正确:
<description><![CDATA<p>some text</p>]></description>
我的代码如下。它包含在Lotus Domino代理中 我该如何解决这个问题?
import java.io.InputStream;
import java.io.PrintWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLDecoder;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;
import lotus.domino.*;
public class JavaAgent extends AgentBase {
public void NotesMain() {
try {
org.w3c.dom.Document newDoc;
DocumentBuilderFactory builderFactory;
DocumentBuilder builder;
Element docElem,tmpElem;
Node tmpNode;
Session session=getSession();
AgentContext agentContext=session.getAgentContext();
// Put URL arguments into a HashMap.
Document doc=agentContext.getDocumentContext();
String[] query=doc.getItemValueString("Query_String").split("&");
HashMap<String,String> queryMap=new HashMap<String,String>(query.length);
for (int i=0; i<query.length; i++) {
int j=query[i].indexOf('=');
if (j<0) queryMap.put(query[i],"");
else queryMap.put(query[i].substring(0,j),URLDecoder.decode(query[i].substring(j+1),"UTF-8"));
}
// Get the "src" URL argument - this is the URL we're reading the feed from.
String urlStr=queryMap.get("src");
if (urlStr==null || urlStr.length()==0) {
System.err.println("Error: source URL not specified.");
return;
}
URL url;
try {
url=new URL(urlStr);
} catch (Exception e) {
System.err.println("Error: invalid source URL.");
return;
}
HttpURLConnection conn=(HttpURLConnection)url.openConnection();
InputStream is=conn.getInputStream();
builderFactory=DocumentBuilderFactory.newInstance();
builder=builderFactory.newDocumentBuilder();
// Create a DocumentBuilder and parse the XML.
builder=builderFactory.newDocumentBuilder();
try {
newDoc=builder.parse(is);
is.close();
conn.disconnect();
} catch (Exception e) {
is.close();
conn.disconnect();
System.err.println("XML parse exception: "+e.toString());
return;
}
docElem=newDoc.getDocumentElement();
docElem.setAttribute("xmlns:ibmwcm","http://purl.org/net/ibmfeedsvc/wcm/1.0");
PrintWriter pw=getAgentOutput();
pw.println("Content-type: text/xml");
DOMImplementationRegistry registry=DOMImplementationRegistry
.newInstance();
DOMImplementationLS impl=(DOMImplementationLS)registry
.getDOMImplementation("LS");
LSOutput lso=impl.createLSOutput();
lso.setCharacterStream(pw);
LSSerializer writer=impl.createLSSerializer();
writer.write(newDoc,lso);
} catch (Exception e) {
e.printStackTrace();
}
}
}
答案 0 :(得分:0)
我已经确定问题与序列化程序无关。
即使我做了一件简单的事情:
pw.print("<description><![CDATA[<p>Some text</p>]]></description>");内部方括号也会被剥离。
但是,如果我在CData中对小于号的符号进行编码,则问题就会消失。 E.g:
pw.print("<description><![CDATA[<p>Some text</p>]]></description>");
我还没有确定原因是PrintWriter类还是Lotus Domino,但无论我是否应该能够在解析和序列化之间修改XML来修复它。