有没有办法让jsoup通过转义不需要的HTML来清除带有HTML的字符串而不是完全删除它?我的例子:
String dirty = "This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
String clean = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
这给出了一个“干净”的字符串:
This is REALLY dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
我想要的是“干净”字符串:
"This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
答案 0 :(得分:3)
假设正在解析字符串而不是HTML文档(根据您的问题),此方法将起作用:
public String escapeHtml(String source) {
Document doc = Jsoup.parseBodyFragment(source);
Elements elements = doc.select("b");
for (Element element : elements) {
element.replaceWith(new TextNode(element.toString(),""));
}
return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
}
您可以将“b”标记作为参数传递给您想要转义的标记列表。
关联的传递JUnit测试:
@Test
public void testHtmlEscaping() throws Exception {
String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
String expected = "This is <b>REALLY</b> dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
String transformed = transformer.escapeHtml(source);
assertEquals(transformed, expected);
}
请注意,我在测试的“预期”字符串中的“a”标记之前添加了一行“\ n”,因为JSoup格式化页面。