我正在尝试使用jsoup解析html字符串:
<div class="test">
<br>From: <b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>>
<br>Cc: Ram <<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>>
<br>
<br>
<br>
<div dir="ltr">Hi,</div>
</div>
Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
totalElements++;
if( childElement.tagName().equalsIgnoreCase( "br" ) )
{
brCount++;
if( brCount == 3 )
break;
}
else
brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
childElements.get( i ).remove();
}
我希望在连续三个br标签之前删除所有内容,并且它们之间不应该有文本节点 即在上面的例子中,它将删除所有标签(html标签和textnode)并输出如下:
<div class="test">
<div dir="ltr">Hi,</div>
</div>
答案 0 :(得分:0)
html的结构似乎是不变的。因此,您可以尝试以下CSS选择器:
div.test br + br + br + div
http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus
String html = "<div class=\"test\">\n <br>From: <b class=\"sendername\">Divya</b> \n <span dir=\"ltr\"><<a href=\"mailto:divya@abc.net\" target=\"_blank\">divya@abc.net</a>></span>\n <br>Date: Wed, May 27, 2015 at 11:10 AM\n <br>Subject: Plan for the day 27/05/2015\n <br>To: Abhishek<<a href=\"mailto:abhishek.sharma@abc.com\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>>, \n <a href=\"mailto:xyz@abc.com\" target=\"_blank\">xyz@abc.com</a>>\n <br>Cc: Ram <<a href=\"mailto:Ram@abc.net\" target=\"_blank\">Ram@abc.net</a>>\n <br>\n <br>\n <br>\n <div dir=\"ltr\">Hi,</div>\n </div>";
Document doc = Jsoup.parse(html);
Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);
Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);
** BEFORE:
<html>
<head></head>
<body>
<div class="test">
<br>From:
<b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:divya@abc.net" target="_blank">divya@abc.net</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<
<a href="mailto:abhishek.sharma@abc.com" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:xyz@abc.com" target="_blank">xyz@abc.com</a>>
<br>Cc: Ram <
<a href="mailto:Ram@abc.net" target="_blank">Ram@abc.net</a>>
<br>
<br>
<br>
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
** AFTER:
<html>
<head></head>
<body>
<div class="test">
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>