确定网站上的页面嵌套

时间:2017-07-24 12:52:46

标签: java url java-ee jsoup

有必要确定主页点击中页面的嵌套级别。怎么做对了?我知道网站上的所有页面都会递归显示。

代码如下所示:

public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
    if (!links.contains(URL)) {
        try {
            //4. (i) If not add it to the index
            if (links.add(URL)) {
                System.out.println(URL);
            }

            //2. Fetch the HTML code
            Document document = Jsoup.connect(URL).get();

            //3. Parse the HTML to extract links to other URLs
            Elements linksOnPage = document.select("a[href]");

            //5. For each extracted URL... go back to Step 4.
            for (Element page : linksOnPage) {
                getPageLinks(page.attr("abs:href"));
            }
        } catch (IOException e) {
            System.err.println("For '" + URL + "': " + e.getMessage());
        }
    }
}

仅检查该链接是否是指向外部网站的链接,如果是,则无需转到该链接。

0 个答案:

没有答案