我正在做一个学校项目,我需要从给定的网址中检索所有链接。
我应该得到所有“a [links]”的列表,并将其写入文档。现在我被困在了获取链接。
public Set<String> getLinks(String url){
Set<String> links = new HashSet<String>();
try {
Connection.Response res = Jsoup.connect(url).userAgent(this.USER_AGENT).execute();
Document doc = res.parse();
Elements elements = doc.select("a");
for(Element element: elements){
links.add(element.attr("href"));
}
// }
return links;
} catch(IOException io){
logger.error(io);
System.err.println(io.getMessage());
io.printStackTrace();
return links;
}
}
这是我正在使用的用户代理字符串
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) "
+ "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36";
我做错了吗?
更新:
我尝试在Main上运行这样的测试:
public static void main(String[] args) {
Set<String> links = new HashSet<String>();
try {
// 1. Obtiene los links de un URL específico y regresa el
// set de strings
// escribe aquí tu código {
Connection.Response res = Jsoup.connect(" https://gist.github.com/mark-cooper/1491327").userAgent("Mozilla").execute();
Document doc = res.parse();
Elements elements = doc.select("a[href]");
for(Element element: elements){
links.add(element.toString());
}
links.forEach(k -> System.out.println(k));
// }
} catch(IOException io){
logger.error(io);
System.err.println(io.getMessage());
io.printStackTrace();
}
}
}
我发现了这个错误
Exception in thread "main" java.lang.IllegalArgumentException: Malformed URL: https://gist.github.com/mark-cooper/1491327
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:78)
at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:38)
at org.jsoup.Jsoup.connect(Jsoup.java:73)
at web.MainCrawler.main(MainCrawler.java:33)
Caused by: java.net.MalformedURLException: no protocol: %20https://gist.github.com/mark-cooper/1491327
at java.net.URL.<init>(URL.java:593)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:76)
... 3 more