JSOUP HTML树在遍历

时间:2016-03-31 15:46:38

标签: java html web-scraping jsoup

我试图遍历维基百科页面的HTML树,但是,它似乎省略了代码中某些HTML元素块。有没有办法防止这种遗漏?

CODE

Document doc = Jsoup.connect(url).timeout(10000).userAgent(USER_AGENT).get();
    // get the first table with the specific class
Element tableWithDetails = doc.select("table[class=infobox geography vcard").get(0);
    tableWithDetails.traverse(new NodeVisitor() {
        public void head(Node node, int depth) {
            if(!node.nodeName().equalsIgnoreCase("#text")){
                p("Entering tag: " + node.nodeName());
            }
        }
        public void tail(Node node, int depth) {
            if(!node.nodeName().equalsIgnoreCase("#text")){
                 p("Exiting tag: " + node.nodeName());
            }
        }
    });

Wikipedia HTML CODE

<table class="infobox geography vcard" style="width:22em;width:23em"> 
 <tbody>
  <tr> 
    <th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap"><span class="fn org"><span class="wrap">Dresden</span></span></th> 
    </tr> 
    <tr> 
    <td colspan="2" style="text-align:center;padding:0.7em 0.8em"><a href="/wiki/File:Dresden_montage.JPG" class="image" title="Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger."><img alt="Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger." src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/300px-Dresden_montage.JPG" width="300" height="390" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/450px-Dresden_montage.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/600px-Dresden_montage.JPG 2x" data-file-width="610" data-file-height="792"></a> 
  <div>
   Clockwise: Dresden at night, 
   <a href="/wiki/Dresden_Frauenkirche" title="Dresden Frauenkirche">Dresden Frauenkirche</a>, 
  <a href="/wiki/Schloss_Pillnitz" title="Schloss Pillnitz" class="mw-redirect">Schloss Pillnitz</a>, 
   <a href="/wiki/Dresden_Castle" title="Dresden Castle">Dresden Castle</a> and 
   <a href="/wiki/Zwinger_(Dresden)" title="Zwinger (Dresden)">Zwinger</a>.
  </div> </td> 
  </tr> 

输出

Entering tag: table
Entering tag: tbody
Entering tag: tr
Entering tag: th
Entering tag: span
Entering tag: span
Exiting tag: span
Exiting tag: span
Exiting tag: th
Exiting tag: tr
Entering tag: tr
Entering tag: td
Entering tag: a
Entering tag: img
Exiting tag: img
Exiting tag: a
Exiting tag: td
Exiting tag: tr

在td和tr之前省略了div。

1 个答案:

答案 0 :(得分:0)

请尝试使用此CSS查询:

table.infobox.geography.vcard

示例代码

String html = "<table class=\"infobox geography vcard\" style=\"width:22em;width:23em\"> \n"
        + " <tbody>\n"
        + " <tr> \n"
        + " <th colspan=\"2\" style=\"text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap\"><span class=\"fn org\"><span class=\"wrap\">Dresden</span></span></th> \n"
        + " </tr> \n"
        + " <tr> \n"
        + " <td colspan=\"2\" style=\"text-align:center;padding:0.7em 0.8em\"><a href=\"/wiki/File:Dresden_montage.JPG\" class=\"image\" title=\"Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger.\"><img alt=\"Clockwise: Dresden at night, Dresden Frauenkirche, Schloss Pillnitz, Dresden Castle and Zwinger.\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/300px-Dresden_montage.JPG\" width=\"300\" height=\"390\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/450px-Dresden_montage.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Dresden_montage.JPG/600px-Dresden_montage.JPG 2x\" data-file-width=\"610\" data-file-height=\"792\"></a> \n"
        + " <div>\n" + " Clockwise: Dresden at night, \n"
        + " <a href=\"/wiki/Dresden_Frauenkirche\" title=\"Dresden Frauenkirche\">Dresden Frauenkirche</a>, \n"
        + " <a href=\"/wiki/Schloss_Pillnitz\" title=\"Schloss Pillnitz\" class=\"mw-redirect\">Schloss Pillnitz</a>, \n"
        + " <a href=\"/wiki/Dresden_Castle\" title=\"Dresden Castle\">Dresden Castle</a> and \n"
        + " <a href=\"/wiki/Zwinger_(Dresden)\" title=\"Zwinger (Dresden)\">Zwinger</a>.\n" + " </div> </td> \n" + " </tr>";

Document doc = Jsoup.parse(html);
Element tableWithDetails = doc.select("table.infobox.geography.vcard").get(0);
tableWithDetails.traverse(new NodeVisitor() {
    public void head(Node node, int depth) {
        if (!node.nodeName().equalsIgnoreCase("#text")) {
            p("Entering tag: " + node.nodeName());
        }
    }

    public void tail(Node node, int depth) {
        if (!node.nodeName().equalsIgnoreCase("#text")) {
            p("Exiting tag: " + node.nodeName());
        }
    }
});

OUTPUT(下面的箭头不是输出的一部分)

Entering tag: table
Entering tag: tbody
Entering tag: tr
Entering tag: th
Entering tag: span
Entering tag: span
Exiting tag: span
Exiting tag: span
Exiting tag: th
Exiting tag: tr
Entering tag: tr
Entering tag: td
Entering tag: a
Entering tag: img
Exiting tag: img
Exiting tag: a
Entering tag: div    <--- 
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Entering tag: a
Exiting tag: a
Exiting tag: div  <--- 
Exiting tag: td
Exiting tag: tr
Exiting tag: tbody
Exiting tag: table

Jsoup 1.8.3

如果它仍然不起作用,也许div添加了一些Javascript。您可以在此答案下方的评论中发布维基百科网址。我来看看吧。