的问题：

Question

我在应用程序中使用SOAP API。我有一些样板代码来处理API响应到这样的java对象：

//first I Remove the soap wrapper: 
String soapResponse = this.callApiEndpointByPage(someIncrementingInt);
ByteArrayInputStream inputStream = new ByteArrayInputStream(soapResponse.getBytes());
SOAPMessage message = MessageFactory.newInstance(SOAPConstants.SOAP_1_2_PROTOCOL).createMessage(null, inputStream);
message.setProperty("Content-Type" ,"text/xml; charset=utf-8"); 
Document doc = message.getSOAPBody().extractContentAsDocument();///<<--- Exception thrown here! 

// Then I initiate an unmarshaller:
JAXBContext context = JAXBContext.newInstance(myPojo.class);
Unmarshaller um = context.createUnmarshaller();     

// Then I unmarshall the XML to a POJO:
MyPojo myPojo = (MyPojo) um.unmarshal(doc);

我点击的API端点是分页的。对于99/100页，上面的代码完美无缺。但是，在处理某些页面时，会抛出此异常：

Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.

仔细研究了SOAP响应之后，XML中包含的一些数据本身就是转义的XML。看起来有点像这样：

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <SomeXMLParentObject xmlns="http://url-endpoint.com/webservices/">
            <SomeXMLChildObject>
                &lt;?xml version="1.0" encoding="utf-16"?&gt;
                &lt;Records count="50000"&gt;
                &lt;someEscapedDataINeedLater&gt;
                tonnes of escaped XML here
                &lt;/someEscapedDataINeedLater&gt;
            </SomeXMLChildObject>
        </SomeXMLParentObject>
    </soap:Body>
</soap:Envelope>

请注意响应编码是UTF-8，但它包含的转义XML是UTF-16。所有页面都有这个 - 但并非所有页面都抛出异常。

我怀疑提供API的软件可能会允许一些很少使用的UTF-16字符 - 这些都会导致问题。

但是，我无法弄清楚如何强制我的代码期望UTF-16。无论我做什么，错误消息都指明他们期待＆＃34; 3字节的UTF-8序列＆＃34;。

在上面的代码中，我明确说明了utf-8：

message.setProperty("Content-Type" ,"text/xml; charset=utf-8");

但是，将其更改为UTF-16无效。检查SOAPMessage表明它仍然期待＆＃39; application / xml＆＃39;。

的问题：

如何使上面的代码期望UTF-16而不是UTF-8
这会解决我遇到的异常吗？

编辑：解决方案in the possible duplicate question似乎是要改变XML的生成方式 - 这在我的情况下不适用，因为我正在使用API而无法控制XML已经形成。

编辑2：

我发现我可以在从字符串中获取字节时设置字符编码，如下所示：

            ByteArrayInputStream inputStream = new ByteArrayInputStream(soapResponse.getBytes(Charsets.UTF_16));

然而，这导致我遇到另一个问题：

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Content is not allowed in prolog.

谷歌搜索，这似乎是由UTF-8 leading character：

引起的

"Another thing that often happens is a UTF-8 BOM (byte order mark), which is allowed before the XML declaration can be treated as whitespace if the document is handed as a stream of characters to an XML parser rather than as a stream of bytes."

这使我感到相信只是将整个应用程序的字符编码更改为UTF-16并不是解决方案。

我有什么想法可以让这些工作用于那些有奇怪字符的页面？

从API解析XML并获取＆＃34; MalformedByteSequenceException：3字节UTF-8序列的字节3无效＆＃34;

的问题：

0 个答案: