需要将输出格式转换为UTF-8
,因为输出不处理特殊字符
任何人都知道如何做到这一点?
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL u = new URL("http://www.aredacao.com.br/tv-saude");
Document doc = builder.parse(u.openStream());
NodeList nodes = doc.getElementsByTagName("item");`
答案 0 :(得分:0)
问题是网站返回<?xml version='1.0' encoding='iso-8859-1'?>
但它应该返回<?xml version='1.0' encoding='UTF-8'?>
。
一种解决方案是自己翻译每个元素的文本:
static void readData()
throws IOException,
ParserConfigurationException,
SAXException {
DocumentBuilder builder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL u = new URL("http://www.aredacao.com.br/tv-saude");
Document doc = builder.parse(u.toString());
NodeList nodes = doc.getElementsByTagName("item");
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
Element el = (Element) node;
String title =
el.getElementsByTagName("title").item(0).getTextContent();
title = treatCharsAsUtf8Bytes(title);
String description =
el.getElementsByTagName("description").item(0).getTextContent();
description = treatCharsAsUtf8Bytes(description);
System.out.println("title=" + title);
System.out.println("description=" + description);
System.out.println();
}
}
private static String treatCharsAsUtf8Bytes(String s) {
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1);
return new String(bytes, StandardCharsets.UTF_8);
}
另一种可能性是编写一个FilterInputStream的子类来替换错误的<?xml
prolog编码,但这样做的工作要多得多,而且如果文档结构复杂,我只会考虑这样做有许多不同的元素,翻译每个元素都会很笨拙。