尝试访问网站时,JSOUP格式错误

时间:2017-09-24 16:23:57

标签: java eclipse jsoup

我试图获取链接列表并形成链接列表,我希望获得使用JSOUP从特定网站访问的链接。

目前我正试图从这里开始。

    public void doCrawl(String url){
    Set<String> thisPageLinks = getLinks(url);
    pageVisited.add(url);

    for (String link: thisPageLinks){
        if (link.contains(this.domain) && !pageVisited.contains(link) && !link.contains("#")){
            logger.info("Subcrawling {}", link);

            doCrawl(link);
        } 
    }
    this.pageLinks.addAll(thisPageLinks);
}

// Obiene los links en un URL dado
public Set<String> getLinks(String url){
    Set<String> links = new HashSet<String>();


    try {

        // 1. Obtiene los links de un URL específico y regresa el
        // set de strings
        // escribe aquí tu código {


        Connection.Response res = Jsoup.connect(url).userAgent(this.USER_AGENT).execute();

        Document doc = res.parse();

        Elements elements = doc.select("a");


        for(Element element: elements){

              links.add(element.attr("href"));

        }

        // }
        return links;



    } catch(IOException io){
        logger.error(io);
        System.err.println(io.getMessage());
        io.printStackTrace();
        return links;
    }
}

通过这个我从网站或链接获得所有链接,该部分有效,但当我尝试验证从方法doCrawl()访问的那些时,我得到这个错误

https://stackoverflow.com
Terminando programa.Exception in thread "main" 
java.lang.IllegalArgumentException: Malformed URL: //pt.stackoverflow.com
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:78)
at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:38)
at org.jsoup.Jsoup.connect(Jsoup.java:73)
at web.Crawler.getLinks(Crawler.java:78)
at web.Crawler.doCrawl(Crawler.java:53)
at web.Crawler.doCrawl(Crawler.java:60)
at web.Crawler.crawl(Crawler.java:48)
at web.MainCrawler.main(MainCrawler.java:29)
Caused by: java.net.MalformedURLException: no protocol: //pt.stackoverflow.com
at java.net.URL.<init>(URL.java:593)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at org.jsoup.helper.HttpConnection.url(HttpConnection.java:76)
... 7 more

关于我做错的任何想法?

更新:我改为绝对href&s,似乎工作正常,但这次我得到了这个错误。可能是因为我连接的文件没有html内容?

Unhandled content type. Must be text/*, application/xml, or  application/xhtml+xml
org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/zip, URL=https://codeload.github.com/gist/c1f5e6ae5208f0b2f83713eaa175d409/zip/master
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:547)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:534)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
    at web.Crawler.getLinks(Crawler.java:78)
    at web.Crawler.doCrawl(Crawler.java:53)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.doCrawl(Crawler.java:60)
    at web.Crawler.crawl(Crawler.java:48)
    at web.MainCrawler.main(MainCrawler.java:29)

0 个答案:

没有答案