Question

我正在使用JAXP来生成和解析从数据库加载某些字段的XML文档。

序列化XML的代码：

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("test");
root.setAttribute("version", text);
doc.appendChild(root);

DOMSource domSource = new DOMSource(doc);
TransformerFactory tFactory = TransformerFactory.newInstance();

FileWriter out = new FileWriter("test.xml");
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, new StreamResult(out));

解析XML的代码：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("test.xml");

我遇到以下异常：

[Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.test.Test.xml(Test.java:27)
    at com.test.Test.main(Test.java:55)

String文本包括u-umlaut和o-umlaut（字符代码0xFC和0xF6）。这些是导致错误的字符。当我自己逃避String时使用＆amp; #xFC;和＆amp;＃xF6;然后问题就消失了。当我写出XML时，其他实体会自动编码。

如何在不自行替换这些字符的情况下正确地写入/读取输出？

（我已经阅读了以下问题：

How to encode characters from Oracle to XML?

Repairing wrong encoding in XML files）

Answer 1

使用FileOutputStream而不是FileWriter。

后者应用自己的编码，几乎肯定不是UTF-8（取决于您的平台，它可能是Windows-1252或IS-8859-1）。

编辑（现在我有时间）：

允许将没有序言的XML文档编码为UTF-8或UTF-16。使用序言，允许指定其编码（序言只能包含US-ASCII字符，因此序言总是可读的。）

读者处理角色;它将解码底层InputStream的字节流。因此，当您将Reader传递给解析器时，您告诉它您已经处理了编码，因此解析器将忽略该序言。当你传递一个InputStream（读取字节）时，它没有做出这个假设，并且会查看序言来定义编码 - 如果它不存在则默认为UTF-8 / UTF-16。

我从未尝试过以UTF-16编码的文件。我怀疑解析器会查找字节顺序标记（BOM）作为文件的前2个字节。

Answer 2

嗯，肯定0xFC和0xF6不是有效的UTF-8个字符。这些应该是两个字节序列：0x3CBC和0x3CB6。

最有可能的问题是，如果字符的原始来源定义为UTF-8则不会。{/ p>

使用Java和UTF-8编码生成有效的XML

2 个答案: