Question

我尝试从网页中提取从<html>到</html>的所有数据。以下代码适用于 .html 文件，但不适用于html网站。

Document doc = Jsoup.parse("http://www.imdb.com", "UTF-8");
System.out.println(doc.text());

提前谢谢。

Answer 1

您宁愿将Document doc = Jsoup.connect("http://www.imdb.com").get();用于远程网站。

Answer 2

Document doc;
try {

    // need http protocol
    doc = Jsoup.connect("http://www.imdb.com").get();

    // get page title
    String title = doc.title();
    System.out.println("title : " + title);

    // get all links
    Elements links = doc.select("a[href]");
    for (Element link : links) {

        // get the value from href attribute
        System.out.println("\nlink : " + link.attr("href"));
        System.out.println("text : " + link.text());

    }

} catch (IOException e) {
    e.printStackTrace();
}

来源：mkyong.com

从网页中提取所有数据

2 个答案: