Jsoup - 使用charset iso-8859-1解析HTML文件

时间:2014-02-23 21:26:20

标签: java character-encoding html-parsing jsoup iso-8859-1

我遇到特殊字符和charset = iso-8859-1时遇到问题。 我在这里使用的相同代码适用于UTF-8,所以我不明白我做错了什么。

以下是代码:

File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
Elements elements = doc.select("span.DEPUTADO")  ;
System.out.println(elements.toString());

这是输出:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jo&atilde;ozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulh&otilde;es</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">In&aacute;cio Loiola</span> 

这是应该如何:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>

我该如何解决?

1 个答案:

答案 0 :(得分:1)

使用EscapeMode.xhtml将为您提供没有实体的输出。 试试这段代码

  File input = new File("/users/marcioapf/example.html");
  Document doc = Jsoup.parse(input, "iso-8859-1", "");
  doc.outputSettings().escapeMode(EscapeMode.xhtml);
  Elements elements = doc.select("span.DEPUTADO")  ;
  System.out.println(elements.toString());