Jsoup - 解析所选元素

时间:2017-09-07 16:02:04

标签: java html dom jsoup

我需要使用Jsoup解析器解析下面的HTML内容。 要求是消除一些标签并获得以下输出。 我无法使用以下代码

获得所需的输出

输入:

<html>

<head>
  <style type=\ "text/css\">
    body {
      font: 12px Arial, Helvetica, sans-serif
    }
    
    tr {
      font: 12px Arial, Helvetica, sans-serif;
      padding: 0px 0px 0px 10px
    }
  </style>
</head>

<body>

  <p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>
  <img id=\ "logo_GMALE.png\" alt=\ "logo GMALE\" src=\ "https://www.GMALE.ch/logo.png\">

  <br><b>Test abc xyz</b><br><br>T +91 98 471 <br>

  <a href=\ "mailto:output.test@GMALE.in\">output.test@GMALE.in</a><br><br><b>Département Team</b><br><br><b>GMALE Assurances</b><br>StreetName 2<br>Postbox 2100<br>Country<br><br>GMALE.ch<br><br>This is a private email contents.<br><br>This e-mail transmission
  is intended for the named addressee(s) only. Its contents are private, confidential and protected from disclosure and should not be read, copied or disclosed by any other person. If you are not the intended recipient, we kindly ask you to notify the
  sender immediately and to delete this e-mail.<br><br>


</body>
</html>

输出:

<p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>

<br><b>Test abc xyz</b><br><br>T +91 98 471 <br>

到目前为止完成的代码如下:

Document doc = Jsoup.parse(content);       
List<Node> childNodes = doc.select("body").get(0).childNodes();
System.out.println("Elements : " + childNodes);
StringBuilder  finalContent = new StringBuilder();
for (Node node : childNodes) {
    if (node instanceof Element) {
        Element subElement = (Element) node;
        if (!subElement.tagName().equals("img")) {
            finalContent.append(subElement);
        }
    } else {
        TextNode textNode = (TextNode) node;
        if(!textNode.getWholeText().startsWith("<a")) {
            finalContent.append(textNode);
        }
    }
}

1 个答案:

答案 0 :(得分:0)

您的问题可以定义如下:解析以下HTML的body并提取所有数据,直到达到<a href=\ "mailto:output.test@GMALE.in\">。如果从这个角度看问题,可以尝试以下方法:

final Document doc = Jsoup.parse(content);
final Elements elements = doc.select("body > *:not(img)");
final Iterator<Element> iterator = elements.iterator();
final StringBuilder finalContent = new StringBuilder();

Element current;
while (iterator.hasNext() && !(current = iterator.next()).tagName().startsWith("a")) {
    finalContent.append(current.toString());
    String siblingText = current.nextSibling().attr("text").trim();
    if (!siblingText.isEmpty()) {
        finalContent.append(siblingText);
    }
}

System.out.println(finalContent);

首先,我们使用选择器<img>选择除body > *:not(img)之外的所有元素。然后我们迭代所有元素,直到我们到达列表的末尾或者我们到达第一个a元素。我们还检查是否存在包含任何内容的同级文本节点 - 这是电话号码的一种情况,因为它没有放在任何HTML标记内,而且它是其中一个<br>标记的兄弟。 / p>

运行此示例会生成以下输出:

 <p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br><br><b>Test_firstname90 Test_lastname90</b><br><br>T +91 98 471<br>

当然,您定义了不同的迭代停止规则,创建此示例是为了给您一个提示。我希望它有所帮助。