utf-8使用标准的openStream和DocumentBuilder

时间:2014-08-31 13:52:14

标签: java xml utf-8 rss

需要将输出格式转换为UTF-8,因为输出不处理特殊字符 任何人都知道如何做到这一点?

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL u = new URL("http://www.aredacao.com.br/tv-saude");
Document doc = builder.parse(u.openStream());
NodeList nodes = doc.getElementsByTagName("item");`

1 个答案:

答案 0 :(得分:0)

问题是网站返回<?xml version='1.0' encoding='iso-8859-1'?>但它应该返回<?xml version='1.0' encoding='UTF-8'?>

一种解决方案是自己翻译每个元素的文本:

static void readData()
throws IOException,
       ParserConfigurationException,
       SAXException {

    DocumentBuilder builder =
        DocumentBuilderFactory.newInstance().newDocumentBuilder();
    URL u = new URL("http://www.aredacao.com.br/tv-saude");
    Document doc = builder.parse(u.toString());
    NodeList nodes = doc.getElementsByTagName("item");
    for (int i = 0; i < nodes.getLength(); i++) {
        Node node = nodes.item(i);
        Element el = (Element) node;

        String title =
            el.getElementsByTagName("title").item(0).getTextContent();
        title = treatCharsAsUtf8Bytes(title);

        String description =
            el.getElementsByTagName("description").item(0).getTextContent();
        description = treatCharsAsUtf8Bytes(description);

        System.out.println("title=" + title);
        System.out.println("description=" + description);
        System.out.println();
    }
}

private static String treatCharsAsUtf8Bytes(String s) {
    byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1);
    return new String(bytes, StandardCharsets.UTF_8);
}

另一种可能性是编写一个FilterInputStream的子类来替换错误的<?xml prolog编码,但这样做的工作要多得多,而且如果文档结构复杂,我只会考虑这样做有许多不同的元素,翻译每个元素都会很笨拙。