Question

我正在尝试使用jSoup提取和显示网页上的所有链接：

Document doc =   Jsoup.connect("https://www.youtube.com/").get();
   Elements links = doc.select("link");
    Elements scripts = doc.select("script");
   for (Element element : links) {
         System.out.println("href:" + element.absUrl("href"));
   }
   for (Element element : scripts) {
         System.out.println("src:" + element.absUrl("src"));

这是我的代码。它工作没有错误，但它没有给我所有的链接，而只是少数。并且许多src元素被输出为空白。这是输出：

herehref:https://s.ytimg.com/yts/cssbin/www-core-vfluKFg1a.css
here`href:https://s.ytimg.com/yts/cssbin/www-home-c4-vfl4p1Pju.css
href:https://s.ytimg.com/yts/cssbin/www-pageframe-vflfdzMKI.css
href:https://s.ytimg.com/yts/cssbin/www-guide-vflTkT47C.css
href:http://www.youtube.com/opensearch?locale=en_US
href:https://s.ytimg.com/yts/img/favicon-vfldLzJxy.ico
href:https://s.ytimg.com/yts/img/favicon_32-vflWoMFGx.png
href:http://www.youtube.com/
href:https://m.youtube.com/?
href:https://m.youtube.com/?
href:https://plus.google.com/115229808208707341778
src:
src:
src:https://s.ytimg.com/yts/jsbin/www-scheduler-vflNAje0j/www-scheduler.js
src:
src:
src:https://s.ytimg.com/yts/jsbin/spf-vfld6zcp2/spf.js
src:https://s.ytimg.com/yts/jsbin/www-en_US-vflLgbz4u/base.js
src:
src:

请告诉我为什么会这样，以及如何纠正它？

Answer 1

当您想要通过link元素选择所有超链接时，您正在选择所有a元素。

script元素的空白输出是由于某些元素没有指向具有src属性的外部脚本源，而是保留内联的javascript语句。

您可以使用其他选择器来获取具有该属性的元素的src，如下所示。

//Get the document
Document doc =   Jsoup.connect("https://www.youtube.com/").get();

//Get all the hyperlinks
Elements links = doc.select("a[href]");
//Loop through them
for (Element element : links) {
     System.out.println("href: " + element.absUrl("href"));
}

//Get all script elements with src
Elements scriptSources = doc.select("[src]");
//Loop through them
for (Element element : scriptSources) {
     System.out.println("src:" + element.absUrl("src"));
}

如何使用Java从网页中提取所有链接（相对和绝对）？

1 个答案: