需要帮助解析JSoup

时间:2018-01-21 04:10:48

标签: android jsoup

我需要帮助解析使用JSoup的这个html。我试图从表中的每一列获取数据值。我一直在查看JSoup文档,试图找出我究竟需要做什么,但仍然不确定。看起来该网站使用CSS和内联格式的组合;其中大部分可以转换为CSS并减少页面大小。

这是html文件的一小部分(实际上它的大小几乎是5 MB)。



<html>

<head>
</head>

<body>
  <table>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>
        <div id="plyrRankings" style="overflow: scroll; overflow-x: hidden;">
          <table id="u868top" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
            <tr>
              <td class="legend titlesmall" bgcolor="#000000" align="left" height="60">#</td>
            </tr>
          </table>
          <table id="u868" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
            <caption style="display:none">
              Live ATP Ranking
            </caption>
            <thead>
              <tr class="legend" bgcolor="#000000">
                <td colspan="14" height="4"></td>
              </tr>
              <tr>
                <td colspan="14" height="1"></td>
              </tr>
              <tr class="tbhead">
                <td><b>#</b></td>
                <td><b>CH</b></td>
                <td><b>Player Name</b></td>
                <td><b>Age</b></td>
                <td><b>Ctry</b></td>
                <td class="title" align="left" colspan="1" height="30" width="50" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByPosition();underlineHeaderColumn(5);"><b>Pts</b></td>
                <td class="title" align="center" colspan="2" height="30" width="30" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(3);underlineHeaderColumn(6);"><b>+/-</b></td>
                <td class="title hdcol" align="center" colspan="1" height="30" width="320" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(4);underlineHeaderColumn(7);"><b>Current Tournament</b></td>
                <td class="title hdcol" align="center" height="30" width="320"><b>Previous Tournaments</b></td>
                <td class="title shcol" align="center" height="30" width="320"><b>Current Tournament</b></td>
                <td><b>Next Pts</b></td>
                <td><b>Max Pts</b></td>
              </tr>
              <tr class="tbhead">
                <td height="1" width="400" colspan="3"></td>
                <td height="1" align="right" width="120" colspan="11"></td>
              </tr>
              <tr>
                <td></td>
              </tr>
            </thead>
            <tbody>
              <tr bgColor="white" class="ESP">
                <td width=20 height=30>&nbsp;1&nbsp;</td>
                <td width=20><b class="smalltxt">&nbsp;&nbsp;</b><b class="chigh">&nbsp;CH&nbsp;</b><b class="smalltxt">&nbsp;&nbsp;&nbsp;</b></td>
                <td>
                  <div class="spr esp"></div>
                </td>
                <td width=150>Rafael Nadal</td>
                <td width=50>31<span style="font-size:66%">.6</span></td>
                <td width=80>ESP<span style="font-size:66%">1</span></td>
                <td width=50>9580</td>
                <td align="center">-</td>
                <td align="center"><b class="smallred">-1020</b></td>
                <td class="hdcol" align="center" width=320>Australian Open R16<br> (R32&nbsp;
                  <a href="" onclick="playVideo('6i9o76bE4vM' );return false;">&nbsp;<img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
                <td class="hdcol" align="center" width=320>-</td>
                <td class="shcol" align="center" width=320>Australian Open R16<br> (R32&nbsp;
                  <a href="" onclick="playVideo('6i9o76bE4vM' );return false;">&nbsp;<img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
                <td width=50>9760</td>
                <td width=50>11400</td>
              </tr>
              <tr>
                <td colspan=14 height=1></td>
              </tr>
            </tbody>
          </table>
        </div>
      </td>
    </tr>
  </table>
</body>

</html>
&#13;
&#13;
&#13;

这是我的Parse类

public static class Parse {

    public static ArrayList<Player> playerList(Document doc) {

        ArrayList<Player> players = new ArrayList();

        try {
            Elements trs = doc.select("tbody tr");                

            for (Element tr : trs) {
                Elements tds = tr.getElementsByTag("td");
                Element td = tds.first();
                System.out.println("Blog: " + td.text());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return players;
    }
}

更新:我已更新源代码,以更准确地显示html的结构。我曾假设 tbody 会出现在元素中。我想我错了,对不起。

2 个答案:

答案 0 :(得分:0)

所以我在解析你提供的代码片段时遇到了一些困难,因为缺少了表元素标记,但是一旦我添加了我就能够使用以下逻辑获取每列中的文本:

public static void main(String args[]) {

    String html = "<html> <head></head> <body> <table>\n" +
            "<tbody>\n" +
            "<tr bgColor=\"white\" class=\"ESP\">\n" +
            "    <td width=20 height=30>&nbsp;1&nbsp;</td>\n" +
            "    <td width=20><b class=\"smalltxt\">&nbsp;&nbsp;</b><b class=\"chigh\">&nbsp;CH&nbsp;</b><b class=\"smalltxt\">&nbsp;&nbsp;&nbsp;</b></td> <td><div class=\"spr esp\"></div></td> \n" +
            "    <td width=150>Rafael Nadal</td> \n" +
            "    <td width=50>31<span style=\"font-size:66%\">.6</span></td> \n" +
            "    <td width=80>ESP<span style=\"font-size:66%\">1</span></td> \n" +
            "    <td width=50>9580</td> <td align=\"center\">-</td> \n" +
            "    <td align=\"center\"><b class=\"smallred\">-1020</b></td> \n" +
            "    <td class=\"hdcol\" align=\"center\" width=320>Australian Open R16<br> (R32&nbsp;<a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" >&nbsp;<img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
            "    <td class=\"hdcol\" align=\"center\" width=320>-</td> <td class=\"shcol\" align=\"center\" width=320>Australian Open R16<br> (R32&nbsp;<a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" >&nbsp;<img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
            "    <td width=50>9760</td> <td width=50>11400</td> \n" +
            "</tr>\n" +
            "</tbody>\n" +
            "</table>\n" +
            "</body>\n" +
            "</html>";

    Document document = Jsoup.parse(html);

    Elements data = document.select("body > table > tbody > tr > td");

    for (Element value : data) {
        System.out.println(value.text());
    }
}

答案 1 :(得分:0)

此代码将成功读取您问题中提供的HTML中的表格内容:

String html = "your html";

Document doc = Jsoup.parse(html);

try {
    // select the table
    Elements table = doc.select("table");
    // select all rows in the table
    Elements trs = table.select("tr");

    for (Element tr : trs) {
        // select all cells in this row
        Elements tds = tr.getElementsByTag("td");
        for (Element td : tds) {
            // print out the cell content
            System.out.println(td.text());
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

鉴于您的问题中提供的HTML,此代码将打印出来:

 1 
   CH    

Rafael Nadal
31.6
ESP1
9580
-
-1020
Australian Open R16 (R32  )
-
Australian Open R16 (R32  )
9760
11400