Question

我在DOM解析阿拉伯字母时遇到问题，我有奇怪的字符。我尝试过改用不同的编码，但我不能。

完整代码位于此链接上：http://test11.host56.com/parser.java

public Document getDomElement(String xml) {
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
   try {
       Reader reader = new InputStreamReader(new ByteArrayInputStream(
       xml.getBytes("UTF-8")));
       InputSource is = new InputSource(reader);

       DocumentBuilder db = dbf.newDocumentBuilder();

       //InputSource is = new InputSource();
       is.setCharacterStream(new StringReader(xml));
       doc = db.parse(is);

       return doc;
   }
}

我的xml文件

<?xml version="1.0" encoding="UTF-8"?>
<music>
<song>
    <id>1</id>    
    <title>اهلا وسهلا</title>
    <artist>بكم</artist>
    <duration>4:47</duration>
    <thumb_url>http://wtever.png</thumb_url>
</song>
</music>

Answer 1

你已经将xml作为String，所以除非该字符串已经包含奇数字符（也就是说，它已经使用错误的编码读入），否则可以通过使用StringReader来避免编码疯狂;例如而不是：

Reader reader = new InputStreamReader(new ByteArrayInputStream(
   xml.getBytes("UTF-8")));

使用：

Reader reader = new StringReader(xml);

编辑：现在我看到了更多的代码，似乎在解析XML之前已经发生了编码问题，因为该部分包含：

HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
xml = EntityUtils.toString(httpEntity);

EntityUtils.toString的javadoc说：

使用实体中的字符集（如果有）转换内容，如果没有，则使用“ISO-8859-1”。

服务器似乎没有向实体发送正确的编码信息，然后HttpUtils使用默认值，而不是UTF-8。

修复：使用采用显式默认编码的变体：

xml = EntityUtils.toString(httpEntity, "utf-8");

这里我假设服务器发送UTF-8。如果服务器使用不同的编码，则应设置该编码而不是UTF-8。（但是，因为XML也声明encoding="UTF-8"我认为是这种情况。）如果服务器使用的编码未知，那么你只能采用疯狂的猜测并且运气不好，抱歉。

Answer 2

如果XML包含Unicode字符（如阿拉伯语或波斯语字母），则StringReader会出现异常。在这些情况下，将InputStream直接传递给Document对象。

DOM解析器用阿拉伯语

2 个答案: