如何使用Jsoup选择存在于同一树级别的两个(或更多)HTML元素?

时间:2017-08-18 23:14:38

标签: java html jsoup

我正在开展一个项目,但我遇到了一个问题。我需要从网站上抓取包含以下HTML代码的数据:

<div class="lin-curso" style="border: 0;">
    <div class="lin-area-c3">
        Vagas 2017
    </div>
</div>
<div class="box10">
    <div class="lin-area-c1">
        L160
    </div>
    <div class="lin-area-c2">
        Acupuntura
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        3155
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=L160&amp;code=3155" title="3155/L160">Instituto Politécnico de Setúbal - Escola Superior de Saúde</a>
    </div>
    <div class="lin-curso-c4">
        20
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        9059
    </div>
    <div class="lin-area-c2">
        Administração e Gestão de Empresas
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        2270
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=9059&amp;code=2270" title="2270/9059">Universidade Católica Portuguesa - Faculdade de Ciências Económicas e Empresariais</a>
    </div>
    <div class="lin-curso-c4">
        n.d.
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        8056
    </div>
    <div class="lin-area-c2">
        Administração e Gestão Pública
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        4275
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=8056&amp;code=4275" title="4275/8056">Instituto Superior de Ciências da Administração</a>
    </div>
    <div class="lin-curso-c4">
        20
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        8194
    </div>
    <div class="lin-area-c2">
        Administração da Guarda Nacional Republicana
    </div>
    <div class="lin-area-c3">
        [Mest Integ]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        7510
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=8194&amp;code=7510" title="7510/8194">Academia Militar</a>
    </div>
    <div class="lin-curso-c4">
        n.d.
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        9672
    </div>
    <div class="lin-area-c2">
        Administração e Marketing
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>

BOX10和line-curso应该形成一个元素而它们不会。 因为在某些行中只有一个BOX10用于一个Lin-curso但是有一些行就像Lin-curso一个Box10,如果Box10和Lin-curso是一个元素就不会有问题,有没有办法我可以将这两者联系起来吗?

编辑:网站链接为:http://www.dges.gov.pt/guias/indcurso.asp?letra=A

元素是“.inside”

1 个答案:

答案 0 :(得分:0)

使用同级选择器时,解决此问题非常容易。在您的情况下,具有类box10的div在表中扮演标题角色,而具有类lin-curso的兄弟div扮演表数据行的角色。我建议先选择课程box10的所有div:

Elements boxes = doc.select("div.box10");

然后你可以迭代boxes并做两件大事:

  1. 从此div中提取您感兴趣的数据(它包含3个子节点,包含类lin-area-c1lin-area-c2lin-area-c3的div)
  2. 选择类lin-curso的兄弟节点并从中提取数据。
  3. Jsoup提供了一个名为Element.nextElementSibling()的方法,它将兄弟元素返回给你调用此方法的元素。因此,当您在元素div.box10上调用它时,您将获得兄弟元素div.lin-curso

      在这种情况下,

    同级表示紧跟在同一树级别的指定节点之后的节点。

    示例性解决方案

    下面你可以找到解析给定网站并将表打印到控制台输出的示例代码:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    final class TestMain {
    
        public static void main(String[] args) throws IOException {
            Document doc = Jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=A").get();
    
            Elements boxes = doc.select("div.box10");
    
            for (Element box : boxes) {
                String linAreaC1 = box.select(".lin-area-c1").text();
                String linAreaC2 = box.select(".lin-area-c2").text();
                String linAreaC3 = box.select(".lin-area-c3").text();
    
                System.out.printf("%s: %s %s%n", linAreaC1, linAreaC2, linAreaC3);
    
                Element linCurso = box.nextElementSibling();
    
                while (linCurso.hasClass("lin-curso")) {
                    String linCursoC2 = linCurso.select(".lin-curso-c2").text();
                    String linCursoC3 = linCurso.select(".lin-curso-c3").text();
                    String linCursoC4 = linCurso.select(".lin-curso-c4").text();
    
                    System.out.printf("%s\t%s\t%s%n", linCursoC2, linCursoC3, linCursoC4);
    
                    linCurso = linCurso.nextElementSibling();
                }
    
                System.out.println("==============================");
            }
        }
    }
    

    我希望它有所帮助。