我需要帮助解析使用JSoup的这个html。我试图从表中的每一列获取数据值。我一直在查看JSoup文档,试图找出我究竟需要做什么,但仍然不确定。看起来该网站使用CSS和内联格式的组合;其中大部分可以转换为CSS并减少页面大小。
这是html文件的一小部分(实际上它的大小几乎是5 MB)。
<html>
<head>
</head>
<body>
<table>
<tr>
<td> </td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td>
<div id="plyrRankings" style="overflow: scroll; overflow-x: hidden;">
<table id="u868top" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
<tr>
<td class="legend titlesmall" bgcolor="#000000" align="left" height="60">#</td>
</tr>
</table>
<table id="u868" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
<caption style="display:none">
Live ATP Ranking
</caption>
<thead>
<tr class="legend" bgcolor="#000000">
<td colspan="14" height="4"></td>
</tr>
<tr>
<td colspan="14" height="1"></td>
</tr>
<tr class="tbhead">
<td><b>#</b></td>
<td><b>CH</b></td>
<td><b>Player Name</b></td>
<td><b>Age</b></td>
<td><b>Ctry</b></td>
<td class="title" align="left" colspan="1" height="30" width="50" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByPosition();underlineHeaderColumn(5);"><b>Pts</b></td>
<td class="title" align="center" colspan="2" height="30" width="30" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(3);underlineHeaderColumn(6);"><b>+/-</b></td>
<td class="title hdcol" align="center" colspan="1" height="30" width="320" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(4);underlineHeaderColumn(7);"><b>Current Tournament</b></td>
<td class="title hdcol" align="center" height="30" width="320"><b>Previous Tournaments</b></td>
<td class="title shcol" align="center" height="30" width="320"><b>Current Tournament</b></td>
<td><b>Next Pts</b></td>
<td><b>Max Pts</b></td>
</tr>
<tr class="tbhead">
<td height="1" width="400" colspan="3"></td>
<td height="1" align="right" width="120" colspan="11"></td>
</tr>
<tr>
<td></td>
</tr>
</thead>
<tbody>
<tr bgColor="white" class="ESP">
<td width=20 height=30> 1 </td>
<td width=20><b class="smalltxt"> </b><b class="chigh"> CH </b><b class="smalltxt"> </b></td>
<td>
<div class="spr esp"></div>
</td>
<td width=150>Rafael Nadal</td>
<td width=50>31<span style="font-size:66%">.6</span></td>
<td width=80>ESP<span style="font-size:66%">1</span></td>
<td width=50>9580</td>
<td align="center">-</td>
<td align="center"><b class="smallred">-1020</b></td>
<td class="hdcol" align="center" width=320>Australian Open R16<br> (R32
<a href="" onclick="playVideo('6i9o76bE4vM' );return false;"> <img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
<td class="hdcol" align="center" width=320>-</td>
<td class="shcol" align="center" width=320>Australian Open R16<br> (R32
<a href="" onclick="playVideo('6i9o76bE4vM' );return false;"> <img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
<td width=50>9760</td>
<td width=50>11400</td>
</tr>
<tr>
<td colspan=14 height=1></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
</body>
</html>
&#13;
这是我的Parse类
public static class Parse {
public static ArrayList<Player> playerList(Document doc) {
ArrayList<Player> players = new ArrayList();
try {
Elements trs = doc.select("tbody tr");
for (Element tr : trs) {
Elements tds = tr.getElementsByTag("td");
Element td = tds.first();
System.out.println("Blog: " + td.text());
}
} catch (Exception e) {
e.printStackTrace();
}
return players;
}
}
更新:我已更新源代码,以更准确地显示html的结构。我曾假设 tbody 会出现在表元素中。我想我错了,对不起。
答案 0 :(得分:0)
所以我在解析你提供的代码片段时遇到了一些困难,因为缺少了表元素标记,但是一旦我添加了我就能够使用以下逻辑获取每列中的文本:
public static void main(String args[]) {
String html = "<html> <head></head> <body> <table>\n" +
"<tbody>\n" +
"<tr bgColor=\"white\" class=\"ESP\">\n" +
" <td width=20 height=30> 1 </td>\n" +
" <td width=20><b class=\"smalltxt\"> </b><b class=\"chigh\"> CH </b><b class=\"smalltxt\"> </b></td> <td><div class=\"spr esp\"></div></td> \n" +
" <td width=150>Rafael Nadal</td> \n" +
" <td width=50>31<span style=\"font-size:66%\">.6</span></td> \n" +
" <td width=80>ESP<span style=\"font-size:66%\">1</span></td> \n" +
" <td width=50>9580</td> <td align=\"center\">-</td> \n" +
" <td align=\"center\"><b class=\"smallred\">-1020</b></td> \n" +
" <td class=\"hdcol\" align=\"center\" width=320>Australian Open R16<br> (R32 <a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" > <img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
" <td class=\"hdcol\" align=\"center\" width=320>-</td> <td class=\"shcol\" align=\"center\" width=320>Australian Open R16<br> (R32 <a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" > <img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
" <td width=50>9760</td> <td width=50>11400</td> \n" +
"</tr>\n" +
"</tbody>\n" +
"</table>\n" +
"</body>\n" +
"</html>";
Document document = Jsoup.parse(html);
Elements data = document.select("body > table > tbody > tr > td");
for (Element value : data) {
System.out.println(value.text());
}
}
答案 1 :(得分:0)
此代码将成功读取您问题中提供的HTML中的表格内容:
String html = "your html";
Document doc = Jsoup.parse(html);
try {
// select the table
Elements table = doc.select("table");
// select all rows in the table
Elements trs = table.select("tr");
for (Element tr : trs) {
// select all cells in this row
Elements tds = tr.getElementsByTag("td");
for (Element td : tds) {
// print out the cell content
System.out.println(td.text());
}
}
} catch (Exception e) {
e.printStackTrace();
}
鉴于您的问题中提供的HTML,此代码将打印出来:
1
CH
Rafael Nadal
31.6
ESP1
9580
-
-1020
Australian Open R16 (R32 )
-
Australian Open R16 (R32 )
9760
11400