我正在使用JSOUP库来解析html.But我最终在我的DOM中添加了编码<>(& lt和& gt)的额外结束标记.Hence我使用了String utils库摆脱那些编码的东西。虽然我仍然有重复的结束标签,但他们没有编码。 所以我的初始html是
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
</head>
<body>
<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">
<div style="height:2058px; padding-left:0px; padding-top:36px;">
<iframe style="height:90px; width:728px;" />
</div>
</div>
</body>
</html>
在通过此代码进行格式化之后
String url = request.getParameter("htmluri").trim();
System.out.println("Fetching %s..."+url);
Document doc = Jsoup.connect(url).get();
//System.out.println(doc.html());
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
//settings.escapeMode(Entities.EscapeMode.base);
settings.charset("ASCII");
String html = doc.html();
html = StringEscapeUtils.unescapeHtml(html);
System.out.println(html);
// String html = doc.html();
System.out.println(html);
我得到这个HTML
<!DOCTYPE html><html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light"><head>
</head>
<body>
<div style="background-image: url(aol.jpeg);background-repeat: no-repeat;-webkit-background-size:100720; height:720; width:100; text-align: center; margin: 0 auto;">
<div style="height:100; padding-left:0px; padding-top:36px;">
<iframe style="height:90px; width:728px;"></iframe>
</div>
</div>
</body>
</html></div></div></body></html>
所以有更多的重复关闭div体和html标签。虽然他们不会伤害页面的渲染我猜。有没有办法摆脱它。
由于 斯瓦拉杰
答案 0 :(得分:1)
我回来了:P。 你只需再次解析你拥有的html,这样jsoup将删除任何额外的结束标记
String url = request.getParameter("htmluri").trim();
System.out.println("Fetching %s..."+url);
Document doc = Jsoup.connect(url).get();
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.charset("ASCII");
String html = doc.html();
html = StringEscapeUtils.unescapeHtml(html);
html = Jsoup.parse(html).html(); //This will take care of any extra closing tags
System.out.println(html);
Fetching %s...http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
<style>
</style>
</head>
<body>
<div style="background-image: url(http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.jpg); background-repeat: no-repeat;-webkit-background-size: 1001px 1903px;height: 1903px; width: 1001px; text-align: center; margin: 0 auto;">
<div style="height:1050px; width:300px; padding-left:681px; padding-top:200px;">
<iframe style="height:1050px; width:300px;"></iframe>
</div>
</div>
</body>
</html>