删除文本节点并检查html中的替代文本节点:Jsoup

时间:2015-05-27 13:34:56

标签: java javascript html parsing jsoup

我正在尝试使用jsoup解析html字符串:

<div class="test">
  <br>From: <b class="sendername">Divya</b> 
  <span dir="ltr">&lt;<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>&gt;</span>
  <br>Date: Wed, May 27, 2015 at 11:10 AM
  <br>Subject: Plan for the day 27/05/2015
  <br>To: Abhishek&lt;<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
    <a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>&gt;
  <br>Cc: Ram &lt;<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>&gt;
  <br>
  <br>
  <br>
  <div dir="ltr">Hi,</div>
 </div>
  

Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
    totalElements++;
    if( childElement.tagName().equalsIgnoreCase( "br" ) )
    {
        brCount++;
        if( brCount == 3 )
            break;
    }
    else
    brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
    childElements.get( i ).remove();
}

我希望在连续三个br标签之前删除所有内容,并且它们之间不应该有文本节点 即在上面的例子中,它将删除所有标签(html标签和textnode)并输出如下:

<div class="test">
  <div dir="ltr">Hi,</div>
 </div>

  1. 如何检查两个br标签之间是否有文本节点?
  2. 上面的代码只是删除了html标签,但文本节点没有被删除。我怎么能删除它?

1 个答案:

答案 0 :(得分:0)

html的结构似乎是不变的。因此,您可以尝试以下CSS选择器:

div.test br + br + br + div

样本

http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus

示例代码

String html = "<div class=\"test\">\n  <br>From: <b class=\"sendername\">Divya</b> \n  <span dir=\"ltr\">&lt;<a href=\"mailto:divya@abc.net\" target=\"_blank\">divya@abc.net</a>&gt;</span>\n  <br>Date: Wed, May 27, 2015 at 11:10 AM\n  <br>Subject: Plan for the day 27/05/2015\n  <br>To: Abhishek&lt;<a href=\"mailto:abhishek.sharma@abc.com\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>&gt;, \n    <a href=\"mailto:xyz@abc.com\" target=\"_blank\">xyz@abc.com</a>&gt;\n  <br>Cc: Ram &lt;<a href=\"mailto:Ram@abc.net\" target=\"_blank\">Ram@abc.net</a>&gt;\n  <br>\n  <br>\n  <br>\n  <div dir=\"ltr\">Hi,</div>\n </div>";

Document doc = Jsoup.parse(html);

Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
    throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);

Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);

输出

** BEFORE:
<html>
 <head></head>
 <body>
  <div class="test"> 
   <br>From: 
   <b class="sendername">Divya</b> 
   <span dir="ltr">&lt;<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>&gt;</span> 
   <br>Date: Wed, May 27, 2015 at 11:10 AM 
   <br>Subject: Plan for the day 27/05/2015 
   <br>To: Abhishek&lt;
   <a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
   <a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>&gt; 
   <br>Cc: Ram &lt;
   <a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>&gt; 
   <br> 
   <br> 
   <br> 
   <div dir="ltr">
    Hi,
   </div> 
  </div>
 </body>
</html>

** AFTER:
<html>
 <head></head>
 <body>
  <div class="test">
   <div dir="ltr">
     Hi, 
   </div>
  </div>
 </body>
</html>