Question

使用jSoup刮取页面时，可以使用页面上的所有链接进行收集;

Elements allLinksOnPage = doc.select("a");

哪个好。现在，如何从此列表中删除重复的URL？即想象一下在主导航中链接的/contact-us.html。

删除所有重复的网址后，下一步就是抓取这些唯一的网址并继续循环。

关于这个问题的实际问题。对于代码;

for (Element e : allLinksOnPage) {
    String absUrl = e.absUrl("href");

    //Asbolute URL Starts with HTTP or HTTPS to make sure it isn't a mailto: link
    if (absUrl.startsWith("http") || absUrl.startsWith("https")) {
        //Check that the URL starts with the original domain name
        if (absUrl.startsWith(getURL)) {
            //Remove Duplicate URLs
            //Not sure how to do this bit yet?
            //Add New URLs found on Page to 'allLinksOnPage' to allow this 
            //For Loop to continue until the entire website has been scraped
        }
    }
}

所以问题是，循环的最后一部分，想象当抓取page-2.html时，在这里识别出更多的URL并添加到allLinksOnPage变量。

for循环是否会继续完整列表的长度，即在页面1.html上找到10个链接，在页面2.html上找到10个链接，因此总共会抓取20个页面 - 或者 - 循环仅在标识的前10个链接的长度上继续，即代码前面的链接（元素e：allLinksOnPage）＆＃39;被触发了吗？

一旦逻辑完成，这一切都将不可避免地最终存在于数据库中，但是最初要保持逻辑纯粹基于Java，以防止对DB进行大量读/写操作，从而减慢所有操作。

Answer 1

allLinksOnPage只迭代一次。您永远不会检索有关您找到链接的页面的任何信息。

但是，您可以使用Set和List。此外，您可以使用URL类为您提取协议。

URL startUrl = ...; 
Set<String> addedPages = new HashSet<>();
List<URL> urls = new ArrayList<>();
addedPages.add(startUrl.toExternalForm());
urls.add(startUrl);
while (!urls.isEmpty()) {
     // retrieve url not yet crawled
     URL url = urls.remove(urls.size()-1);

     Document doc = JSoup.parse(url, TIMEOUT);
     Elements allLinksOnPage = doc.select("a");
     for (Element e : allLinksOnPage) {
        // add hrefs
        URL absUrl = new URL(e.absUrl("href"));

        switch (absUrl.getProtocol()) {
            case "https":
            case "http":
                if (absUrl.toExternalForm().startsWith(getURL) && addedPages.add(absUrl.toExternalForm())) {
                    // add url, if not already added
                    urls.add(absUrl);
                }
        }
    }
}

从jSoup中的元素列表中删除重复的URL？

1 个答案: