我想使用jsoup解析html数据(java开源库)

时间:2018-11-27 16:49:26

标签: java html jsoup html-parsing

我正在制作一个Android应用,可以使用大学提供的在线门户时间表来存储学生时间表。

请查看屏幕截图,因为时间表以以下格式显示:

Please see the image as time table is shown in this format.

由于无法建立可从网站提取数据的模式,因为每一列和每一行都没有id标签,因此我遇到了一个问题。请参阅以下html代码。如果有人可以定义一个很好的模式。请记住,我将仅为此使用java(android)。欢迎所有建议。

<div class="portlet-body">

            <div class="table-responsive">
            <table class="table  table-light">
            <thead>
                <tr>
                    <th>&nbsp;</th>

                        <th style="text-align: center; color: black">MON</th>

                        <th style="text-align: center; color: black">TUE</th>

                        <th style="text-align: center; color: black">WED</th>

                        <th style="text-align: center; color: black">THU</th>

                        <th style="text-align: center; color: black">FRI</th>

                        <th style="text-align: center; color: black">SAT</th>

                        <th style="text-align: center; color: black">SUN</th>

                </tr>
            </thead>
            <tbody>

                    <tr>
                        <td class="label-success" style="color: #fff;">08:00 AM - 09:20 AM</td>


                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Enterprise Application Development Lab(4)<br></div>
                                        <div style="color:gray;">SYED ARSLAN SAEED<br></div>
                                        <div style="color:black;"> [INST LAB-I, B-BLOCK]</div>

                                    </td>


                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Linear Algebra(3)<br></div>
                                        <div style="color:gray;">SHAHANA  RIZVI<br></div>
                                        <div style="color:black;"> [F5]</div>

                                    </td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


                    <tr>
                        <td class="label-success" style="color: #fff;">09:30 AM - 10:50 AM</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Enterprise Application Development Lab(4)<br></div>
                                        <div style="color:gray;">SYED ARSLAN SAEED<br></div>
                                        <div style="color:black;"> [INST LAB-I, B-BLOCK]</div>

                                    </td>

                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Enterprise Application Development(3)<br></div>
                                        <div style="color:gray;">ASAD  MAHMOOD<br></div>
                                        <div style="color:black;"> [F4]</div>

                                    </td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Enterprise Application Development(3)<br></div>
                                        <div style="color:gray;">ASAD  MAHMOOD<br></div>
                                        <div style="color:black;"> [B9]</div>

                                    </td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Linear Algebra(3)<br></div>
                                        <div style="color:gray;">SHAHANA  RIZVI<br></div>
                                        <div style="color:black;"> [E5]</div>

                                    </td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


                    <tr>
                        <td class="label-success" style="color: #fff;">11:00 AM - 12:20 PM</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Principles of Accounting-I(3)<br></div>
                                        <div style="color:gray;">NOUSHEEN TARIQ BHUTTA<br></div>
                                        <div style="color:black;"> [F6]</div>

                                    </td>

                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Principles of Accounting-I(3)<br></div>
                                        <div style="color:gray;">NOUSHEEN TARIQ BHUTTA<br></div>
                                        <div style="color:black;"> [B8]</div>

                                    </td>

                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Mobile Application Development(1)<br></div>
                                        <div style="color:gray;">ANSAR  JAVED<br></div>
                                        <div style="color:black;"> [B2]</div>

                                    </td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


                    <tr>
                        <td class="label-success" style="color: #fff;">12:30 PM - 01:50 PM</td>

                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Mobile Application Development(1)<br></div>
                                        <div style="color:gray;">ANSAR  JAVED<br></div>
                                        <div style="color:black;"> [E5]</div>

                                    </td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


                    <tr>
                        <td class="label-success" style="color: #fff;">02:00 PM - 03:20 PM</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Artificial Intelligence(2)<br></div>
                                        <div style="color:gray;">AAMER  NADEEM<br></div>
                                        <div style="color:black;"> [E4]</div>

                                    </td>

                                <td>&nbsp;</td>


                                <td>&nbsp;</td>


                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


                    <tr>
                        <td class="label-success" style="color: #fff;">03:30 PM - 04:50 PM</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                    <td style="background-color:#ddd;color:black;text-align: center;border-style: solid;">
                                        <div style="color:black;">Artificial Intelligence(2)<br></div>
                                        <div style="color:gray;">AAMER  NADEEM<br></div>
                                        <div style="color:black;"> [B5]</div>

                                    </td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>

                                <td>&nbsp;</td>


                    </tr>


            </tbody>
        </table>
        </div>



</div>

1 个答案:

答案 0 :(得分:0)

使用jsoup

Document doc = Jsoup.connect(url).get();
Elements tableElements = doc.select("table");
Elements rows = tableElements.select("tr"); 
// start from 1, exclude 0 which is a header without td's
for (int i = 1; i < rows.size(); i++) {
   Elements cols = rows.get(i).select("td");
   // print all cols
   for(int j = 0; j < cols.size(); j++){
       System.out.println(cols.get(j).text());
   }
}