使用jsoup css选择器检索表中的第一个最里面的表

时间:2013-04-25 05:36:41

标签: html css-selectors jsoup

当我遇到一个包含表格内表的链接时,我正在使用html表。我已经在整个网址中提取了第一个表格,如下所示,

final Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get();
final Elements tables = document.select("table");     
final Element table = tables.get(0);

现在我想使用html,

下面的Jsoup css选择器提取第一个最里面的表
<table cellspacing="0" cellpadding="0"> 
 <tbody>
  <tr> 
   <td id="header_left"><a href="/">
     <div id="logo"></div></a>
    <!-- end logo --></td> 
   <td id="header_center"> 
    <div id="header_menu"> 
     <h2><a href="http://www.templatemonster.com" target="_blank">WEB DESIGN TEMPLATES</a></h2> 
     <p><a href="http://www.templatemonster.com/website-templates.php/?aff=wdl">HTML &amp; CSS Templates</a></p> 
     <p><a href="http://www.templatemonster.com/wordpress-themes.php/?aff=wdl">Wordpress Themes</a></p>
     <p><a href="http://www.templatemonster.com/prestashop-themes.php/?aff=wdl">PrestaShop Themes</a></p> 
    </div>
    <!-- end header_nemu --> 
    <div id="header_books"></div>
    <!-- end header_books --> </td> 
   <td id="header_right"> 
    <div id="search_pic"></div>
    <!-- end search_pic --> 
    <div id="header_search_div"> 
     <div class="block-search-heading">
      SEARCH
     </div> 
     <form method="get" action="/search.html"> 
      <table> 
       <tbody>
        <tr> 
         <td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td> 
        </tr> 
        <tr> 
         <td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26">&nbsp;&nbsp;Web Design Showcase</option><option value="2">&nbsp;&nbsp;Design Principles</option><option value="108">&nbsp;&nbsp;Typography</option><option value="111">&nbsp;&nbsp;Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102">&nbsp;&nbsp;Drupal</option><option value="103">&nbsp;&nbsp;Joomla</option><option value="100">&nbsp;&nbsp;Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7">&nbsp;&nbsp;Photoshop</option><option value="97">&nbsp;&nbsp;&nbsp;&nbsp;Editor's Pick</option><option value="60">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop Basics</option><option value="61">&nbsp;&nbsp;&nbsp;&nbsp;Special Effects</option><option value="62">&nbsp;&nbsp;&nbsp;&nbsp;Text Effects</option><option value="63">&nbsp;&nbsp;&nbsp;&nbsp;3D Effects</option><option value="64">&nbsp;&nbsp;&nbsp;&nbsp;Textures &amp; Patterns</option><option value="65">&nbsp;&nbsp;&nbsp;&nbsp;Web Layout</option><option value="66">&nbsp;&nbsp;&nbsp;&nbsp;Drawing Techniques</option><option value="67">&nbsp;&nbsp;&nbsp;&nbsp;Color Management</option><option value="68">&nbsp;&nbsp;&nbsp;&nbsp;Photo Editing</option><option value="69">&nbsp;&nbsp;&nbsp;&nbsp;ImageReady Animation</option><option value="72">&nbsp;&nbsp;&nbsp;&nbsp;Miscellaneous</option><option value="81">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS4 Tutorials</option><option value="98">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS5 Tutorials</option><option value="105">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS6 Tutorials</option><option value="53">&nbsp;&nbsp;Vector Graphics</option><option value="21">&nbsp;&nbsp;HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50">&nbsp;&nbsp;Interviews</option><option value="104">&nbsp;&nbsp;Inspiration</option><option value="110">&nbsp;&nbsp;Freebies</option></select></td> 
         <td class="submit"><input type="submit" value="" /></td> 
        </tr> 
       </tbody>
      </table> 
     </form>
    </div>
    <!-- end header_search_div --></td> 
  </tr> 
 </tbody>
</table>

我想得到这个表中的表或第一个最里面的表,

<table> 
       <tbody>
        <tr> 
         <td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td> 
        </tr> 
        <tr> 
         <td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26">&nbsp;&nbsp;Web Design Showcase</option><option value="2">&nbsp;&nbsp;Design Principles</option><option value="108">&nbsp;&nbsp;Typography</option><option value="111">&nbsp;&nbsp;Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102">&nbsp;&nbsp;Drupal</option><option value="103">&nbsp;&nbsp;Joomla</option><option value="100">&nbsp;&nbsp;Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7">&nbsp;&nbsp;Photoshop</option><option value="97">&nbsp;&nbsp;&nbsp;&nbsp;Editor's Pick</option><option value="60">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop Basics</option><option value="61">&nbsp;&nbsp;&nbsp;&nbsp;Special Effects</option><option value="62">&nbsp;&nbsp;&nbsp;&nbsp;Text Effects</option><option value="63">&nbsp;&nbsp;&nbsp;&nbsp;3D Effects</option><option value="64">&nbsp;&nbsp;&nbsp;&nbsp;Textures &amp; Patterns</option><option value="65">&nbsp;&nbsp;&nbsp;&nbsp;Web Layout</option><option value="66">&nbsp;&nbsp;&nbsp;&nbsp;Drawing Techniques</option><option value="67">&nbsp;&nbsp;&nbsp;&nbsp;Color Management</option><option value="68">&nbsp;&nbsp;&nbsp;&nbsp;Photo Editing</option><option value="69">&nbsp;&nbsp;&nbsp;&nbsp;ImageReady Animation</option><option value="72">&nbsp;&nbsp;&nbsp;&nbsp;Miscellaneous</option><option value="81">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS4 Tutorials</option><option value="98">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS5 Tutorials</option><option value="105">&nbsp;&nbsp;&nbsp;&nbsp;Photoshop CS6 Tutorials</option><option value="53">&nbsp;&nbsp;Vector Graphics</option><option value="21">&nbsp;&nbsp;HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50">&nbsp;&nbsp;Interviews</option><option value="104">&nbsp;&nbsp;Inspiration</option><option value="110">&nbsp;&nbsp;Freebies</option></select></td> 
         <td class="submit"><input type="submit" value="" /></td> 
        </tr> 
       </tbody>
      </table> 

我真的很想知道该怎么做。任何指针都会非常有用。

2 个答案:

答案 0 :(得分:3)

据我所知,您无法使用CSS和jsoup选择器语法选择 most inner 元素。如果第一个元素不存在,则无法选择此元素

jsoup中选择器的语法在这里:http://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup选择器主要类似于CSS,而jsop有一组特殊的伪类(在他们的doc中,他们称之为 Pseudo selectors )。

要查找包含css类“block-search”的表:

Elements elements = doc.select("table.block-search");

要找到一个css类“block-search”的表,肯定是<table cellspacing="0" cellpadding="0" id="header_tab">

Elements elements = doc.select("table#header_tab table.block-search");

<table cellspacing="0" cellpadding="0" id="header_tab">中找到第一个带有“block-search”类的子表:

Element element = doc.select("table#header_tab table.block-search").first();

<强> UPD

希望这对你有用。请注意上一个whilecurrent = current.children().select("table").first();

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class AppJsoap {

    public static void main(String... args) throws IOException {

        Document document = Jsoup
                .connect(
                        "http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html")
                .get();
        Elements tables = document.select("table table");

        System.out.println(tables.size());
        for (Element el : tables) {
            System.out.println(path(el));
        }

        {
            System.out.println("------");
            Element found = null;
            Element current = tables.get(0);
            while (current != null) {
                System.out.println("current = " + path(current));
                found = current;
                current = current.children().select("table").first();
            }
            System.out.println("found = " + path(found));
        }
    }

    public static String path(Element el) {
        String path = el.parent() != null ? path(el.parent()) : "";
        path += el.nodeName() + "[" + el.siblingIndex() + "] ";
        return path;
    }
}

输出

31
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[3] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[7] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[11] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[15] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[19] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[23] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[27] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[31] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[35] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[39] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[43] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[47] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[51] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[55] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[59] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[63] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[67] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[71] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[75] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[79] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[83] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[87] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[14] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[22] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[5] div[1] div[1] div[3] form[1] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[7] div[2] div[2] div[2] div[3] table[1] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[25] 
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[29] 
------
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] 
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 
found = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1] 

答案 1 :(得分:0)

在做了打击和试验之后,我终于找到了答案。以下是代码,

Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get();
Elements tables = document.select("table");     
Element table = tables.get(0);

// Checks if a table contains table inside it
while(! table.select(":has(table)").isEmpty()){
    table = table.select("table table").first();
}

它检索表中的第一个最里面的表。