如何在带有Jsoup的表中的特定`th`下获取`td`文本?

时间:2016-06-28 11:54:46

标签: jsoup

以下是从表中提取的行;

<table class="infobox vevent" style="width:22em">
<caption class="summary">Adobe Shockwave Player</caption>
<tr> 
 <td colspan="2" style="text-align:center"><a href="/wiki/File:Adobe_Shockwave_Player_logo.png" class="image"><img alt="Adobe Shockwave Player logo.png" src="//upload.wikimedia.org/wikipedia/en/thumb/8/8e/Adobe_Shockwave_Player_logo.png/64px-Adobe_Shockwave_Player_logo.png" width="64" height="64" srcset="//upload.wikimedia.org/wikipedia/en/thumb/8/8e/Adobe_Shockwave_Player_logo.png/96px-Adobe_Shockwave_Player_logo.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/8/8e/Adobe_Shockwave_Player_logo.png/128px-Adobe_Shockwave_Player_logo.png 2x" data-file-width="165" data-file-height="165"></a></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Software_developer" title="Software developer">Original author(s)</a></th> 
 <td><a href="/wiki/Macromedia" title="Macromedia">Macromedia</a></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Software_developer" title="Software developer">Developer(s)</a></th> 
 <td><a href="/wiki/Adobe_Systems" title="Adobe Systems">Adobe Systems</a></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Software_release_life_cycle" title="Software release life cycle">Stable release</a></th> 
 <td>12.2.4.194 / 19&nbsp;February 2016<span class="noprint">; 4 months ago</span><span style="display:none">&nbsp;(<span class="bday dtstart published updated">2016-02-19</span>)</span><sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[1]</a></sup></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Operating_system" title="Operating system">Operating system</a></th> 
 <td><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Microsoft Windows</a>, <a href="/wiki/Mac_OS_9" title="Mac OS 9">Mac OS 9</a>, <a href="/wiki/Mac_OS_X" class="mw-redirect" title="Mac OS X">Mac OS X</a> (Universal)</td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Computing_platform" title="Computing platform">Platform</a></th> 
 <td><a href="/wiki/Web_browsers" class="mw-redirect" title="Web browsers">Web browsers</a></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/List_of_software_categories" title="List of software categories">Type</a></th> 
 <td>Multimedia Player / <a href="/wiki/MIME" title="MIME">MIME</a> type: application/x-director</td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;"><a href="/wiki/Software_license" title="Software license">License</a></th> 
 <td><a href="/wiki/Proprietary_software" title="Proprietary software">Proprietary</a><sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></td> 
</tr>
<tr> 
 <th scope="row" style="white-space: nowrap;">Website</th> 
 <td><span class="url"><a rel="nofollow" class="external text" href="http://www.adobe.com/products/shockwaveplayer/">www<wbr>.adobe<wbr>.com<wbr>/products<wbr>/shockwaveplayer<wbr>/</a></span></td> 
</tr>
</table>

我想要:

1。 td的文字“12.2.4.194”在特定文字“稳定释放”下。

   2。 td的文本“Microsoft Windows”在特定文本“操作系统”下。

我坚持使用以下代码:

Document doc = Jsoup.connect("url").get();
for (Element table : doc.select("table.infobox")) {
    String strName = table.getElementsByTag("caption").text();
    if (strName.toLowerCase().contains("shockwave player")) {
        Elements trow = table.select("tr");
        System.out.println(trow);
    }
}

1 个答案:

答案 0 :(得分:1)

试试这个CSS查询:

table.infobox tr:has(a:containsOwn(Stable release))    > td,
table.infobox tr:has(a:containsOwn(Microsoft Windows)) > td

DEMO

示例代码:

public static String getTDtext(Element table, String headerText) {
    Element td = table.select("tr:has(a:containsOwn(" + headerText + ")) > td").first();

    if (td==null) {
        throw new RuntimeException("Unable to find text for " + headerText);
    } else {
        return td.ownText();
    }
}

讨论:

tr                           /* Select tr elements ...                 */
:has(                        /* ... having ...                         */
   a                         /* ... an anchor element ...              */
   :containsOwn(headerText)  /* ... containing headerText ...          */
)
> td                         /* Select all td elements direct children */

参考文献: