鉴于这部分html:
<table width="99%">
<tr>
<td valign="top">
<a href="popup_info.cfm?story=3703" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>48-Hour Notice</strong></a>
<br />
<strong>News of Districtwide Interest</strong>
<br />A 48-Hour Notice that the Bridgewater-Raritan Regional Board of Education’s Special Meeting – Policy on Wednesday, May 18, 2016 originally scheduled for 8:00 p.m. at the Harmon V. Wade Administration Building has been rescheduled to begin at 7:00
p.m. Action may be taken.
<br clear="all">
<p></p>
<br />
<a href="popup_info.cfm?story=3578" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>Modified 2015-2016 School Calendar</strong></a>
<br />Adamsville Primary, Bradley Gardens Primary, Crim Primary, Hamilton Primary, John F. Kennedy Primary, Milltown Primary, Van Holten Primary, Eisenhower Intermediate, Hillside Intermediate, Middle School, High School, Home Page Only
<br />At their meeting on Tuesday, May 10, 2016, the Board of Education approved the modification of the 2015-2016 School Calendar to include Monday, June 13, 2016 as a day off for all students and staff. Please refer to the modified school calendar link
below on our district website:modified school calendar
<br clear="all">
<p></p>
<br />
<a href="popup_info.cfm?story=3689" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>Teacher of the Year and Educational Services Professional Award Winners</strong></a>
<br/>
<strong>News of Districtwide Interest</strong>
<br />Congratulations to our staff members who have been named to the 2015-2016 Bridgewater-Raritan Teacher of the Year Award and the 2015-2016 Educational Services Professional Award. These individuals were honored at the district’s Staff Reception,
sponsored by the BREA, on Wednesday, May 4, at the High School. On behalf of the Board of Education, we thank them for their outstanding...
<a href="popup_info.cfm?story=3689" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');">
more info</a>
<br clear="all">
<p></p>
<br />
如何将Strong标记之外的文本解析为单独的元素?
Elements news = doc.select("p:not[^]");
只为我提供了一个包含所有内容的巨型元素,包括强元素中的内容。
理想情况下,我希望代码能够如下工作
Element 1:A
48-Hour Notice that the Bridgewater-Raritan Regional Board of Education’s Special
Meeting – Policy on Wednesday, May 18, 2016 originally scheduled for 8:00 p.m.
at the Harmon V. Wade Administration Building has been rescheduled to begin at
7:00 p.m.
Action
may be taken.
Element 2:Adamsville Primary, Bradley Gardens Primary, Crim Primary, Hamilton Primary, John F. Kennedy Primary, Milltown Primary, Van Holten Primary, Eisenhower Intermediate, Hillside Intermediate, Middle School, High School, Home Page Only<br />
At their meeting on Tuesday, May 10, 2016, the Board of Education approved the modification of the 2015-2016 School Calendar to include Monday, June 13, 2016 as a day off for all students and staff. Please refer to the modified school calendar link below on our district website:modified school calendar
依此类推......
答案 0 :(得分:2)
如何解析与任何元素无关的html文本
从Jsoup 1.9.2开始,使用Selector
类是不可能的。
因此,您的下一个选择是直接使用Jsoup API。特别是,您将使用TextNode
类。此选项需要太多工作。
所以最后一个选项是使用网站的RSS源:http://www.brrsd.k12.nj.us/rss/News.xml。信息格式良好,更容易解析。有关详细信息,请参阅下面的示例代码。
如何找到其他网站的XML页面?
您可以在此处找到更多RSS Feed:http://www.brrsd.k12.nj.us/newinfo.cfm。 进入页面后,单击“RSS源”选项卡。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;
public class FetchRssFeed {
public static void main(String[] args) throws IOException {
String rssFeedUrl = "http://www.brrsd.k12.nj.us/rss/News.xml";
Document doc = Jsoup.connect(rssFeedUrl).parser(Parser.xmlParser()).get();
Elements items = doc.select("item");
for (Element item : items) {
String title = extractData(item, "title", "<NO TITLE>");
String description = extractData(item, "description", "<NO DESCRIPTION>");
if (description.endsWith("... (Continued)")) {
// Fetch full description
String newsUrl = extractData(item, "guid", null);
description += " [UNABLE TO GET FULL DESCRIPTION]";
if (newsUrl != null) {
Document news = Jsoup.connect(newsUrl).get();
Element newsContent = news.select("#content > table > tbody > tr > td").first();
if (newsContent != null) {
Elements tmp = newsContent.select("span.sw-newsHeader");
title = tmp.text();
tmp.remove(); // Remove title to get full description
description = newsContent.text();
}
}
}
System.out.format("Title: %s%nDescription: %s%n%n", title, description);
}
}
private static String extractData(Element item, String dataName, String defaultValue) {
Element data = item.select(dataName).first();
String dataValue;
if (data == null) {
dataValue = defaultValue;
} else {
dataValue = data.text();
}
return dataValue;
}
}
Title: Daily Announcements 5-19-16
Description: 8th grade choir will practice TB47th gr band rehearses TB78th gr band rehearses TB5The school store will be open today during lunch, please stop by.
Title: 6th Grade UPENN Museum Trip, Thursday, May 19, 2016
Description: Students should arrive in the All Purpose Room between 6:45 and 7:00 am. Students should not bring school materials to school with them that day.(...)
(...)