Question

我正在尝试从字节解析文档，如下所示

String result = /* some valid xml document */
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
try {
    Document document = parser.parse(new ByteArrayInputStream(result.getBytes()));
} catch (MalformedByteSequenceException e) {
    System.out.println("(MalformedByteSequenceException ) " + e.getMessage());
}

抛出MalformedByteSequenceException，然后在控制台中打印下一个

"(MalformedByteSequenceException ) Invalid byte 2 of 4-byte UTF-8 sequence."

奇怪的是，相同的代码在我的本地环境（Windows 10）中工作，但在远程环境中不起作用（Windows Server 2012）

我尝试在我的本地环境中复制错误，更改TomEE版本，从1.7.4升级到1.7.1，我尝试将JRE从1.7.0_80更改为1.7.0，我尝试从远程复制完整的Tomee文件夹系统到我的本地机器，仍然只在远程环境中发生错误

使用result.getBytes(Charset.forName("UTF-8"))代替result.getBytes()也不起作用。

Answer 1

我找到了解决方案。在 setenv.bat ，

的开头设置此项

rem Set encoding
set JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8

我不确定这背后的基本原理，但似乎JVM使用了一些奇怪的Windows编码而不是你需要的UTF-8

Answer 2

调用String.getBytes()与调用String.getBytes("<value of file.encoding>")完全相同。

然而，根本不需要打电话。通过parse设置InputSource来呼叫StringReader。

文档解析期间4字节UTF-8序列的字节2无效

2 个答案: