使用JSOUP从文档中检索链接时出现问题

时间:2017-09-24 13:19:45

标签: java html eclipse jsoup

我正在做一个学校项目,我需要从给定的网址中检索所有链接。

我应该得到所有“a [links]”的列表,并将其写入文档。现在我被困在了获取链接。

    public Set<String> getLinks(String url){
    Set<String> links = new HashSet<String>();

    try {


        Connection.Response res = Jsoup.connect(url).userAgent(this.USER_AGENT).execute();

        Document doc = res.parse();

        Elements elements = doc.select("a");


        for(Element element: elements){
            links.add(element.attr("href"));

        }

        // }
        return links;



    } catch(IOException io){
        logger.error(io);
        System.err.println(io.getMessage());
        io.printStackTrace();
        return links;
    }
}

这是我正在使用的用户代理字符串

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) "
        + "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36";

我做错了吗?

更新:

我尝试在Main上运行这样的测试:

    public static void main(String[] args) {


    Set<String> links = new HashSet<String>();

    try {

        // 1. Obtiene los links de un URL específico y regresa el
        // set de strings
        // escribe aquí tu código {

        Connection.Response res = Jsoup.connect(" https://gist.github.com/mark-cooper/1491327").userAgent("Mozilla").execute();

        Document doc = res.parse();

        Elements elements = doc.select("a[href]");


        for(Element element: elements){
            links.add(element.toString());

        }

        links.forEach(k -> System.out.println(k));

        // }



    } catch(IOException io){
        logger.error(io);
        System.err.println(io.getMessage());
        io.printStackTrace();
    }

}

}

我发现了这个错误

Exception in thread "main" java.lang.IllegalArgumentException: Malformed URL:  https://gist.github.com/mark-cooper/1491327
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:78)
at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:38)
at org.jsoup.Jsoup.connect(Jsoup.java:73)
at web.MainCrawler.main(MainCrawler.java:33)
Caused by: java.net.MalformedURLException: no protocol:     %20https://gist.github.com/mark-cooper/1491327
at java.net.URL.<init>(URL.java:593)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:76)
... 3 more

0 个答案:

没有答案