我试图获取链接列表并形成链接列表,我希望获得使用JSOUP从特定网站访问的链接。
目前我正试图从这里开始。
public void doCrawl(String url){
Set<String> thisPageLinks = getLinks(url);
pageVisited.add(url);
for (String link: thisPageLinks){
if (link.contains(this.domain) && !pageVisited.contains(link) && !link.contains("#")){
logger.info("Subcrawling {}", link);
doCrawl(link);
}
}
this.pageLinks.addAll(thisPageLinks);
}
// Obiene los links en un URL dado
public Set<String> getLinks(String url){
Set<String> links = new HashSet<String>();
try {
// 1. Obtiene los links de un URL específico y regresa el
// set de strings
// escribe aquí tu código {
Connection.Response res = Jsoup.connect(url).userAgent(this.USER_AGENT).execute();
Document doc = res.parse();
Elements elements = doc.select("a");
for(Element element: elements){
links.add(element.attr("href"));
}
// }
return links;
} catch(IOException io){
logger.error(io);
System.err.println(io.getMessage());
io.printStackTrace();
return links;
}
}
通过这个我从网站或链接获得所有链接,该部分有效,但当我尝试验证从方法doCrawl()访问的那些时,我得到这个错误
https://stackoverflow.com
Terminando programa.Exception in thread "main"
java.lang.IllegalArgumentException: Malformed URL: //pt.stackoverflow.com
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:78)
at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:38)
at org.jsoup.Jsoup.connect(Jsoup.java:73)
at web.Crawler.getLinks(Crawler.java:78)
at web.Crawler.doCrawl(Crawler.java:53)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.crawl(Crawler.java:48)
at web.MainCrawler.main(MainCrawler.java:29)
Caused by: java.net.MalformedURLException: no protocol: //pt.stackoverflow.com
at java.net.URL.<init>(URL.java:593)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:76)
... 7 more
关于我做错的任何想法?
更新:我改为绝对href&s,似乎工作正常,但这次我得到了这个错误。可能是因为我连接的文件没有html内容?
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/zip, URL=https://codeload.github.com/gist/c1f5e6ae5208f0b2f83713eaa175d409/zip/master
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:547)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:534)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at web.Crawler.getLinks(Crawler.java:78)
at web.Crawler.doCrawl(Crawler.java:53)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.crawl(Crawler.java:48)
at web.MainCrawler.main(MainCrawler.java:29)