Question

我编写了一个简单的递归网络爬虫，以递归方式从网页中获取URL链接。

现在我想找出一种方法来限制爬虫使用深度，但我不知道如何限制爬虫的特定深度（我可以通过前N个链接限制爬虫，但我想限制使用深度）< / p>

For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link

赞赏任何意见。

    public class SimpleCrawler {

    static Map<String, String> retMap = new ConcurrentHashMap<String, String>();    

        public static void main(String args[]) throws IOException {
         StringBuffer sb = new StringBuffer();  
         Map<String, String> map = (returnURL("http://www.google.com"));
         recursiveCrawl(map);
          for (Map.Entry<String, String> entry : retMap.entrySet()) {
            sb.append(entry.getKey());
          }
        }

        public static void recursiveCrawl(Map<String, String> map)
                throws IOException {
            for (Map.Entry<String, String> entry : map.entrySet()) {
                String key = entry.getKey();
                Map<String, String> recurSive = returnURL(key);
                recursiveCrawl(recurSive);
            }
        }

        public synchronized static Map<String, String> returnURL(String URL)
                throws IOException {

            Map<String, String> tempMap = new HashMap<String, String>();
            Document doc = null;
            if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
                System.out.println("Processing==>" + URL);
                try {
                    URL url = new URL(URL);
                    System.setProperty("http.proxyHost", "proxy");
                    System.setProperty("http.proxyPort", "port");
                    doc = Jsoup.connect(URL).get();
                    if (doc != null) {
                        Elements links = doc.select("a");
                        String FinalString = "";
                        for (Element e : links) {
                            FinalString = "http:" + e.attr("href");
                            if (!retMap.containsKey(FinalString)) {
                                tempMap.put(FinalString, FinalString);
                            }
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
                retMap.put(URL, URL);
            } else {
                System.out.println("****Skipping URL****" + URL);
            }
            return tempMap;
        }

    }

EDIT1：

我想过使用worklist，因此修改了代码。我也不确定如何在这里设置深度（我可以设置要抓取但不完全是深度的网页数量）。任何建议将不胜感激。

public void startCrawl(String url) {
        while (this.pagesVisited.size() < 2) {
            String currentUrl;
            SpiderLeg leg = new SpiderLeg();
            if (this.pagesToVisit.isEmpty()) {
                currentUrl = url;
                this.pagesVisited.add(url);
            } else {
                currentUrl = this.nextUrl();
            }
            leg.crawl(currentUrl);
            System.out.println("pagesToVisit Size" + pagesToVisit.size());
            // SpiderLeg
            this.pagesToVisit.addAll(leg.getLinks());
        }
        System.out.println("\n**Done** Visited " + this.pagesVisited.size()
                + " web page(s)");
    }

Answer 1

基于非递归方法：

保留pagesToCrawl

类型的网址CrawlURL的工作清单

class CrawlURL {
  public String url;
  public int depth;

  public CrawlURL(String url, int depth) {
    this.url = url;
    this.depth = depth;
  }
}

最初（在进入循环之前）：

Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from

现在循环：

while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
  CrawlURL currentUrl = pagesToCrawl.remove();
  //analyze the url
  //updated with crawled links
}

并使用链接进行更新：

if (currentUrl.depth < 2) {
  for (String url : leg.getLinks()) { // referring to your analysis result
    pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
  }
}

您可以使用其他元数据（例如链接名称，引荐来源等）来增强CrawlURL。

<强>替代：在我的上层评论中，我提到了一代人的方法。它比这个复杂一点。基本的想法是将列表（currentPagesToCrawl和futurePagesToCrawl）与生成变量一起保存（从0开始，每次currentPagesToCrawl变空时都会增加）。所有已抓取的网址都会放入futurePagesToCrawl队列，如果currentPagesToCrawl，则会切换两个列表。这样做直到生成变量达到2。

Answer 2

您可以在递归方法的签名上添加深度参数，例如

在您的主要

上

recursiveCrawl(map,0);

和

public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
    if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
        for (Map.Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            Map<String, String> recurSive = returnURL(key);
            recursiveCrawl(recurSive, depth);
        }
    }
]

Answer 3

您可以这样做：

static int maxLevels = 10;

public static void main(String args[]) throws IOException {
     ...
     recursiveCrawl(map,0);
     ...
}

public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
    for (Map.Entry<String, String> entry : map.entrySet()) {
        String key = entry.getKey();
        Map<String, String> recurSive = returnURL(key);
        if (level < maxLevels) {
            recursiveCrawl(recurSive, ++level);
        }
    }
}

此外，您可以使用Set代替Map。

如何设置简单JAVA网页抓取工具的深度

3 个答案: