Question

使用Jsoup可以很容易地计算特定标记在文本中的存在次数。例如，我试图查看给定文本中锚标记的次数。

    String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    Document doc = Jsoup.parse(content);
    Elements links = doc.select("a[href]"); // a with href
    System.out.println(links.size());

这给了我一个4的数。如果我有一个句子并且我想知道该句子是否包含任何html标签，是否可以使用Jsoup？谢谢。

Answer 1

你可能最好使用正则表达式，但如果你真的想使用JSoup，那么你可以尝试匹配所有元素，然后减去4，因为JSoup会自动添加四个元素，即首先是root元素，然后是<html>，<head>和<body>元素。

这可能看起来像：

// attempt to count html elements in string - incorrect code, see below 
public static int countHtmlElements(String content) {
    Document doc = Jsoup.parse(content);
    Elements elements = doc.select("*");
    return elements.size()-4;
}

如果文字包含<html>，<head>或<body>，则会产生错误的结果;比较结果：

// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted 
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));

为了使这项工作，你必须分别检查“魔术”标签;这就是为什么我觉得正则表达式可能更简单。

尝试更多失败的尝试失败了：使用parseBodyFragment而不是parse无效，因为JSoup以相同的方式对其进行了清理。同样，计为doc.select("body *");可以省去减去4的麻烦，但如果涉及<body>，它仍会产生错误的计数。只有当您的应用程序确定要检查的字符串中不存在<html>，<head>或<body>元素时，它才可能在该限制下工作。

使用Jsoup存在HTML标记

1 个答案: