如何使用Jsoup提取HTML的单独部分?

时间:2014-02-02 13:14:40

标签: java android html jsoup

我已经使用了一些Jsoup方法来获取包含网页HTML代码部分的字符串:

protected String doInBackground(String... arguments) {
        // extract arguments
        String newsurl = arguments[0];
        //
        Document doc = null;
        try {
            doc = Jsoup.connect(newsurl).get();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        if (doc != null) {
            Elements myElements = doc.getElementsByClass("news_list");

            string1 = myElements.toString();
            Log.i("ELEMENTS HTML", string1);
        } else {
            string1 = "FAILED";
        }
        return string1;

    }

但是,我无法真正找到将HTML文件进一步划分为Elements类的可字符串部分的方法。我觉得我的方法不正确。

我想要使用的HTML部分如下所示:

<table class="news_list" cellspacing="0" cellpadding="0" border="0" id="ctl00_cphInnerPage_cntrlNewsList_gvNews" style="border-width:0px;width:100%;border-collapse:collapse;">
    <tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462"><img src="/mc_newsdata/photos/635254712252165967_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">1/16/2014</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">Science Fair</a>
                                </div>
                                <div class="summary">
                                    The annual Science Fair of the American College of Sofia took place on Wednesday, January 15. You could see photos of some of the incredible projects and experiments in our photo gallery.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">1/10/2014</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">ACS Students’ Results from PISA 2012</a>
                                </div>
                                <div class="summary">
                                    ACS recently received the official results of our students&rsquo; performance at the Programme for International Student Assessment (PISA) 2012. PISA is a triennial international survey developed by the Organisation for Economic Co-operation and Development (OECD) that takes place since 2000. It evaluates education systems worldwide by testing the skills and knowledge of 15-16-year-old students in the key subjects: reading, mathematics and science, with a focus on one subject in each year of assessment. In 2012, the assessment focused on students&rsquo; knowledge in mathematics. 
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">12/20/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">PHOTOS FROM THE CHRISTMAS CONCERT AND THE ALUMNI RECEPTION</a>
                                </div>
                                <div class="summary">
                                    You can see some great photos from the amazing Annual Christmas Concert taken by Konstantin Karchev from 11 Grade, as well as some photos from the Alumni Reception by visiting the photogallery of the website.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">12/19/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">THREE ACS MEDAL-WINNERS MEET PRESIDENT PLEVNELIEV</a>
                                </div>
                                <div class="summary">
                                    On December 16, the Third Olympic Meeting of the members of the national student science teams with Bulgarian President Rosen Plevneliev took place. Three ACS students, well-known in the ACS community for their successes in science, were among the invited: Viktor Kouzmanov 12/4, Konstantin Karchev 11/4, and Mihaela Zaharieva from the Class of 2013. 
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456"><img src="/mc_newsdata/photos/635228847467352694_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">12/17/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">ACS Debaters with an Award from a National Debate Tournament</a>
                                </div>
                                <div class="summary">
                                    This past weekend ACSers from the Debate Club (with faculty advisors Adam Saligman, Milka Getsovska, and Michael Deegan) took part along with students from 14 other schools from all over the country in the first national Bulgarian Forensic League (&ldquo;BFL&rdquo;) tournament of the year. An ACS team consisting of students Adelina Ivanova (11/7), Veselin Nanov (10/2), and Mihail Georgiev (10/7) won the first prize in the &quot;Karl Popper Debate&quot; varsity category, a specific format involving a team of three debating another team of three, all in the age group of Grades 10 to 12. Congratulations to Adelina, Veselin, Mihail, and their faculty advisors for their great achievement!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">12/13/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">ACS Senior Victorious at an International Physics Olympiad</a>
                                </div>
                                <div class="summary">
                                    Last Saturday, the Bulgarian Physics Team featuring ACS senior Victor Kouzmanov returned with the special Grand Prix team prize, one silver, and two bronze medals from the International Experimental Physics Olympiad held in Moscow November 27 through December 6. Congratulations and lots of success for the future to Victor and his teammates!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">12/4/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">ACS Alumnus Won a Prestigious Trading Competition in the US</a>
                                </div>
                                <div class="summary">
                                    Congratulations to Kubrat Danailov of the ACS Class of 2011 on winning the prestigious Intercollegiate Trading Competition held in Boston, USA last month after competing with 100 other students from some of the best universities in the USA - MIT, Harvard, UPenn, Princeton, Yale, Columbia, Cornell, UChicago, Wellesley, Baruch, NYU, and Boston University.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452"><img src="/mc_newsdata/photos/635210613621441367_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">11/26/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">ACS OPEN VOLLEYBALL TOURNAMENT RESULTS</a>
                                </div>
                                <div class="summary">
                                    The ACS OPEN Volleyball Tournament 2013 took place between Nov 18 through 24. <br/><br/>Below you can see the final standings for boys and girls, as well as the MVP awards winners:
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451"><img src="/mc_newsdata/photos/635209646534734186_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">11/22/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">ART EXHIBITION</a>
                                </div>
                                <div class="summary">
                                    The latest Art Exhibition is posted in the Art Gallery in Sanders Hall. It shows works of ACS students drawn in the elective Art classes. <br/>
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449"><img src="/mc_newsdata/photos/635204602267379593_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">11/19/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">Day of Tolerance at ACS</a>
                                </div>
                                <div class="summary">
                                    Today, November 19, ACS Club Embrace is organizing a series of events to mark the International Day of Tolerance celebrated on November 16 since 1995. After the discussion held during advisory periods and the lunch happening at Ostrander Foyer (see photo) the event will be marked by a screening at 3:30 PM of short movies dedicated to the subject of tolerance. All members of the ACS community are welcome to see the thought-provoking short movies!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr>

</table>

我想提取每条新闻的标题,日期,链接和内容,并将其分发到数组/字符串中,并获取图像的链接。

感谢您的帮助!!

修改 我发现每个信息节点都有其独特的类名,理论上我可以通过它进行搜索。但Elements类没有类似于GetElementsByClass的类。

1 个答案:

答案 0 :(得分:3)

您可以使用getElementsByTag,因为您知道子元素是什么。在这种情况下,您需要一个具有所需值的所有子表的句柄:

因此,请将Elements更改为:

Elements myElements = doc.getElementsByClass("news_list").first().getElementsByTag("table");

现在遍历每个元素以获取您的各个元素:

for (Element el : myElements) {

                Element title = el.getElementsByClass("home_title").first();
                Element date = el.getElementsByClass("home_date").first();
                Element link = el.getElementsByClass("news_list_image").first();

                System.out.println(title.text());
                System.out.println(date.text());
                System.out.println(link.child(0).attr("href"));
                System.out.println();

            }

值:

Science Fair
1/16/2014
/NewsDetails.aspx?cat_id=1&news_id=462

ACS Students’ Results from PISA 2012
1/10/2014
/NewsDetails.aspx?cat_id=1&news_id=461

PHOTOS FROM THE CHRISTMAS CONCERT AND THE ALUMNI RECEPTION
12/20/2013
/NewsDetails.aspx?cat_id=1&news_id=458