Question

我正在使用Jsoup 1.9.2来处理和清理特定标记的一些XML输入。在此期间，我注意到Jsoup在被要求清除title标签时表现得很奇怪。具体来说，title标记中的其他XML标记不会被删除，实际上会被其转义表单替换。

我为此创建了一个简短的单元测试，如下所示。测试失败，因为output的值为CuCl<sub>2</sub>。

@Test
public void stripXmlSubInTitle() {
    final String input = "<title>CuCl<sub>2</sub></title>";
    final String output = Jsoup.clean(input, Whitelist.none());
    assertEquals("CuCl2", output);
}

如果title标记被其他标记（例如p或div）替换，那么一切都按预期工作。任何解释和解决方法将不胜感激。

Answer 1

title标记应在head（或html）标记内的HTML5中使用。由于它用于显示HTML文档的标题，主要是在浏览器窗口/标签中，因此不应该有子标记。

JSoup以与p或div等实际内容标签不同的方式对待它，同样适用于textarea。

编辑：

你可以这样做：

public static void main(String[] args) {
    try {
        final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
        Document document = Jsoup.parse(input);
        Elements titles = document.getElementsByTag("title");
        for (Element element : titles) {
            element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
        }
        System.out.println(document.body().toString());
    } catch (Exception e) {
        e.printStackTrace();
    }
}

那会回来：

<body>
 <content>
  <title>CuCl2</title>
  <othertag>
   blabla
  </othertag>
  <title>title with no subtags</title>
 </content>
</body>

根据您的需要，需要进行一些调整，例如

System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));

那会回来：

CuCl2  blabla  title with no subtags

Jsoup干净标题标签失败

1 个答案: