Jsoup表解析

时间:2014-01-25 19:41:54

标签: java parsing jsoup

我是jsoup的新手和解析的东西,所以如果您需要更多信息以便能够回答我的问题,请告诉我!

我有这个表,我想用Java中的Jsoup解析。我只想得到以下文字:

“B S Computer Science,CS(2012-2014)”

从表格的这一部分

  <h3>Fahran S Kamili (fsk226)</h3>
        <div>
            10 Degree Audit Requests Returned.
        </div>
        <table>
            <thead>
                <tr>
<!-- *nrfkh - 9/2012: [degaudt-634]* -->
                        <th colspan="8">Degree Audits Requested</th>

<!-- *end nrfkh - 9/2012: [degaudt-634]* -->

                </tr>
                <tr>
                    <th>Rerun</th>

<!-- *nrfkh - 9/2012: [degaudt-634]* -->

<!-- *end nrfkh - 9/2012: [degaudt-634]* -->
                    <th>Request Created</th>
<!-- *nrfkh - 9/2012: [degaudt-634]* -->

<!-- *end nrfkh - 9/2012: [degaudt-634]* -->
                    <th>Audit Type</th>
                    <th>Program</th>
                    <th>Courses Requested</th>
                    <th>Request Status</th>
                    <th>Audit ID</th>
                    <th>Delete Option</th>
                </tr>
            </thead>
                    <tbody><tr>
                        <td>
                                    <a href="https://utdirect.utexas.edu/apps/degree/audits/requests/student_individual/?form-0-eid=fsk226&form-0-name=Fahran%20S%20Kamili&form-0-begin_ccyy=2012&form-0-degree_plan=ESC%20SS%20CS&form-0-minor=&current=X&future=&planned=&form-TOTAL_FORMS=20&form-INITIAL_FORMS=0&form-MAX_NUM_FORMS=&rerun=" target="_blank">Rerun</a>
                        </td>
<!-- *nrfkh - 9/2012: [degaudt-634]* -->
<!-- *end nrfkh - 9/2012: [degaudt-634]* -->
                        <td>
                            12/20/2013
                            05:06 PM
                        </td>
<!-- *nrfkh - 9/2012: [degaudt-634]* -->
<!-- *end nrfkh - 9/2012: [degaudt-634]* -->
                        <td>
                                Normal

                        </td>
                        <td>
                            B S Computer Science, CS
                            (2012-2014)
                        </td>

该表实际上是长时间拉伸但是包含只是彼此的兄弟姐妹(所以我假设如果我可以得到这个文本,我也可以轻松获得其他文本。)

1 个答案:

答案 0 :(得分:0)

如果我将HTML的部分保存到文件并通过jsoup进行解析,我会尝试打印遇到的所有td个元素,因为这就是您所追求的:

public static void main(String... args) throws IOException {
        File input = new File("C:/users/XYZ/desktop/input.html");
        Document doc = Jsoup.parse(input, "UTF-8", "");
        Elements tds = doc.getElementsByTag("td");
        for (Element td : tds) {
            System.out.println(td.text());
        }
    }

<强>输出:

Rerun
12/20/2013 05:06 PM
Normal
B S Computer Science, CS (2012-2014)