我试图遍历维基百科页面的HTML树,但是,它似乎省略了代码中某些HTML元素块。有没有办法防止这种遗漏?
Document doc = Jsoup.connect(url).timeout(10000).userAgent(USER_AGENT).get();
// get the first table with the specific class
Element tableWithDetails = doc.select("table[class=infobox geography vcard").get(0);
tableWithDetails.traverse(new NodeVisitor() {
public void head(Node node, int depth) {
if(!node.nodeName().equalsIgnoreCase("#text")){
p("Entering tag: " + node.nodeName());
}
}
public void tail(Node node, int depth) {
if(!node.nodeName().equalsIgnoreCase("#text")){
p("Exiting tag: " + node.nodeName());
}
}
});
<table class="infobox geography vcard" style="width:22em;width:23em">
<tbody>
<tr>
<th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap"><span class="fn org"><span class="wrap">Dresden</span></span></th>
</tr>
<tr>
<td colspan="2" style="text-align:center;padding:0.7em 0.8em"><a href="/wiki/File:Dresden_montage.JPG" class="image" title="Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger."><img alt="Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger." src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/300px-Dresden_montage.JPG" width="300" height="390" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/450px-Dresden_montage.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/600px-Dresden_montage.JPG 2x" data-file-width="610" data-file-height="792"></a>
<div>
Clockwise: Dresden at night,
<a href="/wiki/Dresden_Frauenkirche" title="Dresden Frauenkirche">Dresden Frauenkirche</a>,
<a href="/wiki/Schloss_Pillnitz" title="Schloss Pillnitz" class="mw-redirect">Schloss Pillnitz</a>,
<a href="/wiki/Dresden_Castle" title="Dresden Castle">Dresden Castle</a> and
<a href="/wiki/Zwinger_(Dresden)" title="Zwinger (Dresden)">Zwinger</a>.
</div> </td>
</tr>
Entering tag: table
Entering tag: tbody
Entering tag: tr
Entering tag: th
Entering tag: span
Entering tag: span
Exiting tag: span
Exiting tag: span
Exiting tag: th
Exiting tag: tr
Entering tag: tr
Entering tag: td
Entering tag: a
Entering tag: img
Exiting tag: img
Exiting tag: a
Exiting tag: td
Exiting tag: tr
在td和tr之前省略了div。
答案 0 :(得分:0)
请尝试使用此CSS查询:
table.infobox.geography.vcard
String html = "<table class=\"infobox geography vcard\" style=\"width:22em;width:23em\"> \n"
+ " <tbody>\n"
+ " <tr> \n"
+ " <th colspan=\"2\" style=\"text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap\"><span class=\"fn org\"><span class=\"wrap\">Dresden</span></span></th> \n"
+ " </tr> \n"
+ " <tr> \n"
+ " <td colspan=\"2\" style=\"text-align:center;padding:0.7em 0.8em\"><a href=\"/wiki/File:Dresden_montage.JPG\" class=\"image\" title=\"Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger.\"><img alt=\"Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger.\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/300px-Dresden_montage.JPG\" width=\"300\" height=\"390\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/450px-Dresden_montage.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/600px-Dresden_montage.JPG 2x\" data-file-width=\"610\" data-file-height=\"792\"></a> \n"
+ " <div>\n" + " Clockwise: Dresden at night, \n"
+ " <a href=\"/wiki/Dresden_Frauenkirche\" title=\"Dresden Frauenkirche\">Dresden Frauenkirche</a>, \n"
+ " <a href=\"/wiki/Schloss_Pillnitz\" title=\"Schloss Pillnitz\" class=\"mw-redirect\">Schloss Pillnitz</a>, \n"
+ " <a href=\"/wiki/Dresden_Castle\" title=\"Dresden Castle\">Dresden Castle</a> and \n"
+ " <a href=\"/wiki/Zwinger_(Dresden)\" title=\"Zwinger (Dresden)\">Zwinger</a>.\n" + " </div> </td> \n" + " </tr>";
Document doc = Jsoup.parse(html);
Element tableWithDetails = doc.select("table.infobox.geography.vcard").get(0);
tableWithDetails.traverse(new NodeVisitor() {
public void head(Node node, int depth) {
if (!node.nodeName().equalsIgnoreCase("#text")) {
p("Entering tag: " + node.nodeName());
}
}
public void tail(Node node, int depth) {
if (!node.nodeName().equalsIgnoreCase("#text")) {
p("Exiting tag: " + node.nodeName());
}
}
});
Entering tag: table
Entering tag: tbody
Entering tag: tr
Entering tag: th
Entering tag: span
Entering tag: span
Exiting tag: span
Exiting tag: span
Exiting tag: th
Exiting tag: tr
Entering tag: tr
Entering tag: td
Entering tag: a
Entering tag: img
Exiting tag: img
Exiting tag: a
Entering tag: div <---
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Exiting tag: div <---
Exiting tag: td
Exiting tag: tr
Exiting tag: tbody
Exiting tag: table
Jsoup 1.8.3
如果它仍然不起作用,也许div添加了一些Javascript。您可以在此答案下方的评论中发布维基百科网址。我来看看吧。