Question

我有一种情况，我需要抓取一组网页，其中只包含一些xml数据，我想得到一个特定元素的属性。我怎么能在java中这样做？

说，xml strurcture是

<page>
       <student id=2406>
        .
        .
       </student>

       .
       . 
       . 
</page>

我需要抓取大量网页，因此请建议使用快速抓取工具

编辑：我看过一些与此有关的网页，但我找不到合理的答案。也欢迎任何代码

Answer 1

Jsoup 将是一个很好的抓取工具。以下是您可以用它做的事情：

String xml = "this would be your xml";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
for (Element e : doc.select("tag")) {
    System.out.println(e); //this will print the node with "tag"
}

要抓取网页，请使用以下代码：

Document doc = Jsoup.connect("url").get();

在Java中进行Web爬行

1 个答案: