我有一个XML文档,它以下面的方式开始:
<?xml version="1.0"?>
<!DOCTYPE viewdef [
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space U+00A0 ISOnum -->
<!ENTITY copy "©"> <!-- copyright sign, U+00A9 ISOnum -->
<!ENTITY amp "&"> <!-- ampersand -->
<!ENTITY shy "­"> <!-- soft hyphen -->
]>
我正在使用Jsoup 1.8.2以下列方式解析文档:
public static void convertXml(String inFile, String outFile) throws Exception {
String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8"));
Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
FileUtils.writeStringToFile(new File(outFile), document.html(), "UTF-8");
}
我希望输出文件在这种情况下与输入相同,但是Jsoup会生成它:
<?xml version="1.0"?> <!DOCTYPE viewdef>
<!-- no-break space = non-breaking space U+00A0 ISOnum -->
<!--ENTITY copy "©"-->
<!-- copyright sign, U+00A9 ISOnum -->
<!--ENTITY amp "&"-->
<!-- ampersand -->
<!--ENTITY shy "­"-->
<!-- soft hyphen --> ]>
这是一个错误还是有任何方法可以保留原始DOCTYPE声明?
答案 0 :(得分:0)
在使用Jsoup解析
xmlString
之前,先用手动替换DOCTYPE序列,然后将其添加回最终文档中。
private final static String DOCTYPE_SEQUENCE = "<doctype-sequence/>";
private final static Pattern patern = Pattern.compile("(?i)<!DOCTYPE[\s\S]+]>");
public static void convertXml(String inFile, String outFile) throws Exception {
String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8"));
// * Remove the doctype sequence if found
String doctype = "";
Matcher matcher = pattern.matcher(xmlString);
if (matcher.find()) {
doctype = matcher.group(0);
xmlString = xmlString.replace( doctype, DOCTYPE_SEQUENCE);
}
// *
Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
FileUtils.writeStringToFile(new File(outFile), document.html().replace(DOCTYPE_SEQUENCE, doctype), "UTF-8");
}
pattern
变量在convertXml
之外,以避免多个模式编译。