JSOUP:如何解析与任何元素无关的html文本

时间:2016-05-18 12:32:26

标签: jsoup

鉴于这部分html:

<table width="99%">
  <tr>
    <td valign="top">
      <a href="popup_info.cfm?story=3703" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>48-Hour Notice</strong></a>
      <br />
      <strong>News of Districtwide Interest</strong>
      <br />A 48-Hour Notice that the Bridgewater-Raritan Regional Board of Education’s Special Meeting – Policy on Wednesday, May 18, 2016 originally scheduled for 8:00 p.m. at the Harmon V. Wade Administration Building has been rescheduled to begin at 7:00
      p.m. &nbsp; Action may be taken. &nbsp;
      <br clear="all">
      <p></p>
      <br />
      <a href="popup_info.cfm?story=3578" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>Modified 2015-2016 School Calendar</strong></a>
      <br />Adamsville Primary, Bradley Gardens Primary, Crim Primary, Hamilton Primary, John F. Kennedy Primary, Milltown Primary, Van Holten Primary, Eisenhower Intermediate, Hillside Intermediate, Middle School, High School, Home Page Only
      <br />At their meeting on Tuesday, May 10, 2016, the Board of Education approved the modification of the 2015-2016 School Calendar to include Monday, June 13, 2016 as a day off for all students and staff. Please refer to the modified school calendar link
      below on our district website:modified school calendar&nbsp;
      <br clear="all">
      <p></p>
      <br />
      <a href="popup_info.cfm?story=3689" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');"><strong>Teacher of the Year and Educational Services Professional Award Winners</strong></a>
      <br/>
      <strong>News of Districtwide Interest</strong>
      <br />Congratulations to our staff members who have been named to the 2015-2016 Bridgewater-Raritan Teacher of the Year Award and the 2015-2016 Educational Services Professional Award. &nbsp;These individuals were honored at the district’s Staff Reception,
      sponsored by the BREA, on Wednesday, May 4, at the High School. &nbsp;On behalf of the Board of Education, we thank them for their outstanding...
      <a href="popup_info.cfm?story=3689" target="popup2" onclick="var hwin=window.open('', 'popup2', 'resizable=1,scrollbars=yes,status=no,width=620,height=450');">
    more info</a> 
      <br clear="all">
      <p></p>
      <br />

如何将Strong标记之外的文本解析为单独的元素?

Elements news = doc.select("p:not[^]"); 

只为我提供了一个包含所有内容的巨型元素,包括强元素中的内容。

理想情况下,我希望代码能够如下工作

Element 1:A
48-Hour Notice that the Bridgewater-Raritan Regional Board of Education’s Special
Meeting – Policy on Wednesday, May 18, 2016 originally scheduled for 8:00 p.m.
at the Harmon V. Wade Administration Building has been rescheduled to begin at
7:00 p.m. 

&nbsp;

Action
may be taken.

&nbsp;  
Element 2:Adamsville Primary, Bradley Gardens Primary, Crim Primary, Hamilton Primary, John F. Kennedy Primary, Milltown Primary, Van Holten Primary, Eisenhower Intermediate, Hillside Intermediate, Middle School, High School, Home Page Only<br />
At their meeting on Tuesday, May 10, 2016, the Board of Education approved the modification of the 2015-2016 School Calendar to include Monday, June 13, 2016 as a day off for all students and staff. Please refer to the modified school calendar link below on our district website:modified school calendar&nbsp; 

依此类推......

1 个答案:

答案 0 :(得分:2)

  

如何解析与任何元素无关的html文本

从Jsoup 1.9.2开始,使用Selector类是不可能的。 因此,您的下一个选择是直接使用Jsoup API。特别是,您将使用TextNode类。此选项需要太多工作。

所以最后一个选项是使用网站的RSS源:http://www.brrsd.k12.nj.us/rss/News.xml。信息格式良好,更容易解析。有关详细信息,请参阅下面的示例代码。

  

如何找到其他网站的XML页面?

您可以在此处找到更多RSS Feed:http://www.brrsd.k12.nj.us/newinfo.cfm。 进入页面后,单击“RSS源”选项卡。

示例代码

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class FetchRssFeed {

    public static void main(String[] args) throws IOException {
        String rssFeedUrl = "http://www.brrsd.k12.nj.us/rss/News.xml";
        Document doc = Jsoup.connect(rssFeedUrl).parser(Parser.xmlParser()).get();

        Elements items = doc.select("item");

        for (Element item : items) {
            String title = extractData(item, "title", "<NO TITLE>");
            String description = extractData(item, "description", "<NO DESCRIPTION>");

            if (description.endsWith("... (Continued)")) {
                // Fetch full description
                String newsUrl = extractData(item, "guid", null);
                description += " [UNABLE TO GET FULL DESCRIPTION]";

                if (newsUrl != null) {
                    Document news = Jsoup.connect(newsUrl).get();
                    Element newsContent = news.select("#content > table > tbody > tr > td").first();

                    if (newsContent != null) {
                        Elements tmp = newsContent.select("span.sw-newsHeader");
                        title = tmp.text();
                        tmp.remove(); // Remove title to get full description

                        description = newsContent.text();
                    }
                }
            }

            System.out.format("Title: %s%nDescription: %s%n%n", title, description);
        }
    }

    private static String extractData(Element item, String dataName, String defaultValue) {
        Element data = item.select(dataName).first();
        String dataValue;

        if (data == null) {
            dataValue = defaultValue;
        } else {
            dataValue = data.text();
        }

        return dataValue;
    }
}

OUTPUT(截断)

Title: Daily Announcements 5-19-16
Description: 8th grade choir will practice TB47th gr band rehearses TB78th gr band rehearses TB5The school store will be open today during lunch, please stop by.

Title: 6th Grade UPENN Museum Trip, Thursday, May 19, 2016
Description: Students should arrive in the All Purpose Room between 6:45 and 7:00 am. Students should not bring school materials to school with them that day.(...)
(...)