我正在尝试解析和编辑以UTF-8编码的XML文件,但是某些字符将以其HTML数字代码的形式返回,而不是字符本身。
要解决此问题,我设置了一个DOM解析器,使它基本上无需编辑即可复制XML。我专门使用日语汉字/中文字符,但是某些字符已被解析并作为其HTML代码返回。我尝试在输入流,转换器和输出流上将编码指定为UTF-8,但结果是相同的。我摘录了https://www.journaldev.com/901/modify-xml-file-in-java-dom-parser的这段代码。
String filePath = "file path";
File xmlFile = new File(filePath);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlFile);
doc.getDocumentElement().normalize();
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File("updated.xml"));
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(source, result);
System.out.println("XML file updated successfully");
} catch (SAXException | ParserConfigurationException | IOException | TransformerException e1)
{
e1.printStackTrace();
}
}
这是XML解析之前的样子,返回后应该看起来一样:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: -->
<character>
<literal></literal>
</character>
这将返回什么:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Entry for Kanji: -->
<character>
<literal>𠮟</literal>
</character>
答案 0 :(得分:0)
似乎核心问题是Transformer.transform()
仅支持基本多语言平面(BMP)中字符的“干净”转换,尽管故事的内容可能还不止这些。我从您的链接中克隆了代码,并根据包含几个CJK字符的示例创建了一个输入XML文件:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
<!-- Basic Multilingual Plane -->
<!-- CJK Unified Ideographs Extension A -->
<literal>U+3400 㐀</literal>
<literal>U+4DB5 䶵</literal>
<!-- CJK Unified Ideographs Extension -->
<literal>U+53F1 叱</literal>
<!-- Supplementary Ideographic Plane -->
<!-- CJK Unified Ideographs Extension B -->
<literal>U+20000 </literal>
<literal>U+20B9F </literal>
<literal>U+2A6D6 </literal>
<!-- CJK Unified Ideographs Extension C -->
<literal>U+2A700 </literal>
<literal>U+2B734 </literal>
<!-- CJK Unified Ideographs Extension D -->
<literal>U+2B740 </literal>
<literal>U+2B81D </literal>
</character>
当我运行应用程序(使用JDK 11)时,BMP中的三个CJK字符已正确转换,但是补充表意文字(SIP)中的所有CJK字符都转换为HTML转义码。这是创建的XML文件:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<character>
<!-- Basic Multilingual Plane -->
<!-- CJK Unified Ideographs Extension A -->
<literal>U+3400 㐀</literal>
<literal>U+4DB5 䶵</literal>
<!-- CJK Unified Ideographs Extension -->
<literal>U+53F1 叱</literal>
<!-- Supplementary Ideographic Plane -->
<!-- CJK Unified Ideographs Extension B -->
<literal>U+20000 𠀀</literal>
<literal>U+20B9F 𠮟</literal>
<literal>U+2A6D6 𪛖</literal>
<!-- CJK Unified Ideographs Extension C -->
<literal>U+2A700 𪜀</literal>
<literal>U+2B734 𫜴</literal>
<!-- CJK Unified Ideographs Extension D -->
<literal>U+2B740 𫝀</literal>
<literal>U+2B81D 𫠝</literal>
</character>
当我在调试器中运行代码时,JRE似乎使用Xalan来实现Transformer.transform()
。有一个非常古老的SO帖子Serializing supplementary unicode characters into XML documents with Java,它与您的问题不完全相同,但与之相关。海报甚至在2012年针对问题ToXMLStream does not support unicode supplementary characters提出了一份Xalan错误报告,该报告仍未公开!
您在注释中提到的字符(U + 20B9F)在SIP中,这大概就是为什么将其转换为转义码的原因,而非常相似的字符
叱
(U + 53F1)位于BMP中并正确转换。
我不知道为什么存在此问题,但是有几种可能的原因:
Transformer.transform()
实现仅支持BMP中的字符。Transformer.transform()
实现不支持四字节Unicode字符的转换。