我正在开展一个项目,但我遇到了一个问题。我需要从网站上抓取包含以下HTML代码的数据:
<div class="lin-curso" style="border: 0;">
<div class="lin-area-c3">
Vagas 2017
</div>
</div>
<div class="box10">
<div class="lin-area-c1">
L160
</div>
<div class="lin-area-c2">
Acupuntura
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
</div>
<div class="lin-curso-c2">
3155
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=L160&code=3155" title="3155/L160">Instituto Politécnico de Setúbal - Escola Superior de Saúde</a>
</div>
<div class="lin-curso-c4">
20
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
9059
</div>
<div class="lin-area-c2">
Administração e Gestão de Empresas
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
</div>
<div class="lin-curso-c2">
2270
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=9059&code=2270" title="2270/9059">Universidade Católica Portuguesa - Faculdade de Ciências Económicas e Empresariais</a>
</div>
<div class="lin-curso-c4">
n.d.
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
8056
</div>
<div class="lin-area-c2">
Administração e Gestão Pública
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
</div>
<div class="lin-curso-c2">
4275
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=8056&code=4275" title="4275/8056">Instituto Superior de Ciências da Administração</a>
</div>
<div class="lin-curso-c4">
20
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
8194
</div>
<div class="lin-area-c2">
Administração da Guarda Nacional Republicana
</div>
<div class="lin-area-c3">
[Mest Integ]
</div>
</div>
<div class="lin-curso">
<div class="lin-curso-c1">
</div>
<div class="lin-curso-c2">
7510
</div>
<div class="lin-curso-c3">
<a href="detcursopi.asp?codc=8194&code=7510" title="7510/8194">Academia Militar</a>
</div>
<div class="lin-curso-c4">
n.d.
</div>
</div>
<br>
<div class="box10">
<div class="lin-area-c1">
9672
</div>
<div class="lin-area-c2">
Administração e Marketing
</div>
<div class="lin-area-c3">
[Lic-1º cic]
</div>
</div>
BOX10和line-curso应该形成一个元素而它们不会。 因为在某些行中只有一个BOX10用于一个Lin-curso但是有一些行就像Lin-curso一个Box10,如果Box10和Lin-curso是一个元素就不会有问题,有没有办法我可以将这两者联系起来吗?
编辑:网站链接为:http://www.dges.gov.pt/guias/indcurso.asp?letra=A
元素是“.inside”
答案 0 :(得分:0)
使用同级选择器时,解决此问题非常容易。在您的情况下,具有类box10
的div在表中扮演标题角色,而具有类lin-curso
的兄弟div扮演表数据行的角色。我建议先选择课程box10
的所有div:
Elements boxes = doc.select("div.box10");
然后你可以迭代boxes
并做两件大事:
lin-area-c1
,lin-area-c2
和lin-area-c3
的div)lin-curso
的兄弟节点并从中提取数据。 Jsoup提供了一个名为Element.nextElementSibling()
的方法,它将兄弟元素返回给你调用此方法的元素。因此,当您在元素div.box10
上调用它时,您将获得兄弟元素div.lin-curso
。
在这种情况下,同级表示紧跟在同一树级别的指定节点之后的节点。
下面你可以找到解析给定网站并将表打印到控制台输出的示例代码:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
final class TestMain {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=A").get();
Elements boxes = doc.select("div.box10");
for (Element box : boxes) {
String linAreaC1 = box.select(".lin-area-c1").text();
String linAreaC2 = box.select(".lin-area-c2").text();
String linAreaC3 = box.select(".lin-area-c3").text();
System.out.printf("%s: %s %s%n", linAreaC1, linAreaC2, linAreaC3);
Element linCurso = box.nextElementSibling();
while (linCurso.hasClass("lin-curso")) {
String linCursoC2 = linCurso.select(".lin-curso-c2").text();
String linCursoC3 = linCurso.select(".lin-curso-c3").text();
String linCursoC4 = linCurso.select(".lin-curso-c4").text();
System.out.printf("%s\t%s\t%s%n", linCursoC2, linCursoC3, linCursoC4);
linCurso = linCurso.nextElementSibling();
}
System.out.println("==============================");
}
}
}
我希望它有所帮助。