当我遇到一个包含表格内表的链接时,我正在使用html表。我已经在整个网址中提取了第一个表格,如下所示,
final Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get();
final Elements tables = document.select("table");
final Element table = tables.get(0);
现在我想使用html,
下面的Jsoup css选择器提取第一个最里面的表<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td id="header_left"><a href="/">
<div id="logo"></div></a>
<!-- end logo --></td>
<td id="header_center">
<div id="header_menu">
<h2><a href="http://www.templatemonster.com" target="_blank">WEB DESIGN TEMPLATES</a></h2>
<p><a href="http://www.templatemonster.com/website-templates.php/?aff=wdl">HTML & CSS Templates</a></p>
<p><a href="http://www.templatemonster.com/wordpress-themes.php/?aff=wdl">Wordpress Themes</a></p>
<p><a href="http://www.templatemonster.com/prestashop-themes.php/?aff=wdl">PrestaShop Themes</a></p>
</div>
<!-- end header_nemu -->
<div id="header_books"></div>
<!-- end header_books --> </td>
<td id="header_right">
<div id="search_pic"></div>
<!-- end search_pic -->
<div id="header_search_div">
<div class="block-search-heading">
SEARCH
</div>
<form method="get" action="/search.html">
<table>
<tbody>
<tr>
<td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td>
</tr>
<tr>
<td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26"> Web Design Showcase</option><option value="2"> Design Principles</option><option value="108"> Typography</option><option value="111"> Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102"> Drupal</option><option value="103"> Joomla</option><option value="100"> Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7"> Photoshop</option><option value="97"> Editor's Pick</option><option value="60"> Photoshop Basics</option><option value="61"> Special Effects</option><option value="62"> Text Effects</option><option value="63"> 3D Effects</option><option value="64"> Textures & Patterns</option><option value="65"> Web Layout</option><option value="66"> Drawing Techniques</option><option value="67"> Color Management</option><option value="68"> Photo Editing</option><option value="69"> ImageReady Animation</option><option value="72"> Miscellaneous</option><option value="81"> Photoshop CS4 Tutorials</option><option value="98"> Photoshop CS5 Tutorials</option><option value="105"> Photoshop CS6 Tutorials</option><option value="53"> Vector Graphics</option><option value="21"> HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50"> Interviews</option><option value="104"> Inspiration</option><option value="110"> Freebies</option></select></td>
<td class="submit"><input type="submit" value="" /></td>
</tr>
</tbody>
</table>
</form>
</div>
<!-- end header_search_div --></td>
</tr>
</tbody>
</table>
我想得到这个表中的表或第一个最里面的表,
<table>
<tbody>
<tr>
<td colspan="2" class="keyword"><input type="text" id="search-keyword" name="keywords" value="" title=" - Any Keyword(s) - " /></td>
</tr>
<tr>
<td class="category"><select id="category" name="category"> <option value="0" style="font-weight:bold;">- All categories -</option> <option value="-1" style="font-weight:bold;">Website Templates</option><option value="1" style="font-weight: bold; ">Web Design Basics</option><option value="26"> Web Design Showcase</option><option value="2"> Design Principles</option><option value="108"> Typography</option><option value="111"> Responsive Design</option><option value="99" style="font-weight: bold; ">CMS</option><option value="102"> Drupal</option><option value="103"> Joomla</option><option value="100"> Wordpress</option><option value="109" style="font-weight: bold; ">Tutorials</option><option value="7"> Photoshop</option><option value="97"> Editor's Pick</option><option value="60"> Photoshop Basics</option><option value="61"> Special Effects</option><option value="62"> Text Effects</option><option value="63"> 3D Effects</option><option value="64"> Textures & Patterns</option><option value="65"> Web Layout</option><option value="66"> Drawing Techniques</option><option value="67"> Color Management</option><option value="68"> Photo Editing</option><option value="69"> ImageReady Animation</option><option value="72"> Miscellaneous</option><option value="81"> Photoshop CS4 Tutorials</option><option value="98"> Photoshop CS5 Tutorials</option><option value="105"> Photoshop CS6 Tutorials</option><option value="53"> Vector Graphics</option><option value="21"> HTML and CSS</option><option value="30" style="font-weight: bold; ">Miscellaneous</option><option value="50"> Interviews</option><option value="104"> Inspiration</option><option value="110"> Freebies</option></select></td>
<td class="submit"><input type="submit" value="" /></td>
</tr>
</tbody>
</table>
我真的很想知道该怎么做。任何指针都会非常有用。
答案 0 :(得分:3)
据我所知,您无法使用CSS和jsoup选择器语法选择 most inner 元素。如果第一个元素不存在,则无法选择此元素。
jsoup中选择器的语法在这里:http://jsoup.org/cookbook/extracting-data/selector-syntax
Jsoup选择器主要类似于CSS,而jsop有一组特殊的伪类(在他们的doc中,他们称之为 Pseudo selectors )。
要查找包含css类“block-search”的表:
Elements elements = doc.select("table.block-search");
要找到一个css类“block-search”的表,肯定是<table cellspacing="0" cellpadding="0" id="header_tab">
:
Elements elements = doc.select("table#header_tab table.block-search");
在<table cellspacing="0" cellpadding="0" id="header_tab">
中找到第一个带有“block-search”类的子表:
Element element = doc.select("table#header_tab table.block-search").first();
<强> UPD 强>
希望这对你有用。请注意上一个while
与current = current.children().select("table").first();
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class AppJsoap {
public static void main(String... args) throws IOException {
Document document = Jsoup
.connect(
"http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html")
.get();
Elements tables = document.select("table table");
System.out.println(tables.size());
for (Element el : tables) {
System.out.println(path(el));
}
{
System.out.println("------");
Element found = null;
Element current = tables.get(0);
while (current != null) {
System.out.println("current = " + path(current));
found = current;
current = current.children().select("table").first();
}
System.out.println("found = " + path(found));
}
}
public static String path(Element el) {
String path = el.parent() != null ? path(el.parent()) : "";
path += el.nodeName() + "[" + el.siblingIndex() + "] ";
return path;
}
}
输出
31
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[3]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[7]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[11]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[15]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[19]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[23]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[27]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[31]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[35]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[39]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[43]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[47]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[51]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[55]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[59]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[63]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[67]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[71]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[75]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[79]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[83]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[13] table[87]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[14] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[3] div[22] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[4] div[2] div[1] div[5] div[1] div[1] div[3] form[1] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[2] td[1] table[1] tbody[1] tr[0] td[7] div[2] div[2] div[2] div[3] table[1]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[25]
#document[0] html[1] body[2] div[7] table[1] tbody[1] tr[4] td[3] table[29]
------
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1]
current = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1]
found = #document[0] html[1] body[2] div[7] table[1] tbody[1] tr[0] td[1] table[1] tbody[1] tr[0] td[5] div[4] form[3] table[1]
答案 1 :(得分:0)
在做了打击和试验之后,我终于找到了答案。以下是代码,
Document document = Jsoup.connect("http://www.webdesign.org/html-and-css/tutorials/table-examples.6139.html").get();
Elements tables = document.select("table");
Element table = tables.get(0);
// Checks if a table contains table inside it
while(! table.select(":has(table)").isEmpty()){
table = table.select("table table").first();
}
它检索表中的第一个最里面的表。