使用JSOUP从html表中提取非结构化数据

时间:2013-11-05 05:06:51

标签: java dom jsoup html-table

我一直在尝试使用jSoup来使用此代码。我们的想法是从这个页面中提取电影时间表:

http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970

到目前为止,我只能单独提取电影的名称。因为它标有特定的类名(“separator2”)。其余的被命名为“分隔符”。

我正在尝试使用for循环建立以下步骤: 对于TABLE中的每个ROW:

  1. 获得电影名称
  2. 跳过它下方的一行(步骤#1的行)。
  3. 使用名为“separator”
  4. 的类获取第二个
  5. 从下面的所有位置获取第二个(步骤#3中的行)。直到它到达包含名为“separator2”
  6. 的类的下一行
  7. 重复此过程,直到所有行都已处理完毕。
  8. 有人可以建议我该怎么办?或许是一个更好的建议?

    感谢。

    到目前为止我的代码:

    public void getMovieSchedule(String movieUrl) throws IOException
    {
    
    
        //URL url = new URL(movieUrl);
        //Document doc = Jsoup.parse(url, 3000);
    
        //Element table = doc.select("table[div=scheduletbl]").first();
        //Iterator<Element> ite = table.select("tr").iterator();
        //ite.next(); // Skip the first row.
    
        // Actual content
        //print(ite.next().text());
    
        *** CODE ABOVE DOES NOT WORK ***
    
        //final String urlSchedule = "http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970";
    
        Document doc = Jsoup.connect(movieUrl).get();
        Elements div = doc.select("div.panelbox");
    
        for(Element child : div)
        {
            Elements table = child.select("table");
            Elements row = table.select("tr"); // The actual content.
    
            for (Element a: row)
            {
                Elements cinemaName = a.select("td.separator2");
                print(cinemaName.text().toString());
            }
        }
    }
    

    要提取的HTML(省略了一些代码):

    <table width="95%" border="0" cellpadding="2" cellspacing="0" id="scheduletbl">
        <tbody>
    
        <tr>
        <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td>
        </tr>
    
        <tr>
        <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Paris van Java</a></strong></td>
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td>
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        10:30&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=10:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        13:15&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=13:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        16:00&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=16:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        18:45&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=18:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        21:30&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=21:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA</strong></td>
        </tr>
    
        <tr>
        <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Grand Indonesia</a></strong></td>
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td>
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        10:45&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=10:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        13:30&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=13:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        16:15&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=16:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        19:00&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=19:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    
        </tr>
        <tr>
        <td class="separator">&nbsp;</td>
        <td width="20%" class="separator" rel="2D">
        21:45&nbsp;&nbsp;&nbsp;
        </td>
        <td width="30%" class="separator">
        <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=21:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
        </tr>
        ... MORE <tr> here ...
        </tbody></table>
    

1 个答案:

答案 0 :(得分:0)

如果我理解你的问题,你只想从表格中提取一些细节(即电影院名称,日期和时间),但是你遇到了麻烦,因为大多数行都有相同的className。

基于此,这是我的解决方案:

Elements e = doc.select("table#scheuletbl > tbody > tr > td");
for (Element el : e) {
    if (el.hasClass("separator2")) System.out.println(el.text()); // cinema name
    else if (el.toString().contains("colspan=\"2\"")) System.out.println(el.text()); // date
    else if (el.hasAttr("rel")) System.out.println(el.text()); // times
}

将打印出来:

BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG
TUESDAY, 05 NOVEMBER 2013
10:30   
13:15   
16:00   
18:45   
21:30   
BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA
TUESDAY, 05 NOVEMBER 2013
10:45   
13:30   
16:15   
19:00   
21:45 

当然,此解决方案与该网站上的特定表格高度耦合,但只要其格式不经常更改并且在该网站上保持一致,它就可以正常工作。您可以考虑创建一个类来存储所有这些信息。