如何设置简单JAVA网页抓取工具的深度

时间:2015-12-15 18:38:12

标签: java web-crawler jsoup

我编写了一个简单的递归网络爬虫,以递归方式从网页中获取URL链接。

现在我想找出一种方法来限制爬虫使用深度,但我不知道如何限制爬虫的特定深度(我可以通过前N个链接限制爬虫,但我想限制使用深度)< / p>

For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link

赞赏任何意见。

    public class SimpleCrawler {

    static Map<String, String> retMap = new ConcurrentHashMap<String, String>();    

        public static void main(String args[]) throws IOException {
         StringBuffer sb = new StringBuffer();  
         Map<String, String> map = (returnURL("http://www.google.com"));
         recursiveCrawl(map);
          for (Map.Entry<String, String> entry : retMap.entrySet()) {
            sb.append(entry.getKey());
          }
        }

        public static void recursiveCrawl(Map<String, String> map)
                throws IOException {
            for (Map.Entry<String, String> entry : map.entrySet()) {
                String key = entry.getKey();
                Map<String, String> recurSive = returnURL(key);
                recursiveCrawl(recurSive);
            }
        }

        public synchronized static Map<String, String> returnURL(String URL)
                throws IOException {

            Map<String, String> tempMap = new HashMap<String, String>();
            Document doc = null;
            if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
                System.out.println("Processing==>" + URL);
                try {
                    URL url = new URL(URL);
                    System.setProperty("http.proxyHost", "proxy");
                    System.setProperty("http.proxyPort", "port");
                    doc = Jsoup.connect(URL).get();
                    if (doc != null) {
                        Elements links = doc.select("a");
                        String FinalString = "";
                        for (Element e : links) {
                            FinalString = "http:" + e.attr("href");
                            if (!retMap.containsKey(FinalString)) {
                                tempMap.put(FinalString, FinalString);
                            }
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
                retMap.put(URL, URL);
            } else {
                System.out.println("****Skipping URL****" + URL);
            }
            return tempMap;
        }

    }

EDIT1:

我想过使用worklist,因此修改了代码。我也不确定如何在这里设置深度(我可以设置要抓取但不完全是深度的网页数量)。任何建议将不胜感激。

public void startCrawl(String url) {
        while (this.pagesVisited.size() < 2) {
            String currentUrl;
            SpiderLeg leg = new SpiderLeg();
            if (this.pagesToVisit.isEmpty()) {
                currentUrl = url;
                this.pagesVisited.add(url);
            } else {
                currentUrl = this.nextUrl();
            }
            leg.crawl(currentUrl);
            System.out.println("pagesToVisit Size" + pagesToVisit.size());
            // SpiderLeg
            this.pagesToVisit.addAll(leg.getLinks());
        }
        System.out.println("\n**Done** Visited " + this.pagesVisited.size()
                + " web page(s)");
    }

3 个答案:

答案 0 :(得分:2)

基于非递归方法:

保留pagesToCrawl

类型的网址CrawlURL的工作清单
class CrawlURL {
  public String url;
  public int depth;

  public CrawlURL(String url, int depth) {
    this.url = url;
    this.depth = depth;
  }
}

最初(在进入循环之前):

Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from

现在循环:

while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
  CrawlURL currentUrl = pagesToCrawl.remove();
  //analyze the url
  //updated with crawled links
}

并使用链接进行更新:

if (currentUrl.depth < 2) {
  for (String url : leg.getLinks()) { // referring to your analysis result
    pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
  }
}

您可以使用其他元数据(例如链接名称,引荐来源等)来增强CrawlURL。

<强>替代: 在我的上层评论中,我提到了一代人的方法。它比这个复杂一点。基本的想法是将列表(currentPagesToCrawlfuturePagesToCrawl)与生成变量一起保存(从0开始,每次currentPagesToCrawl变空时都会增加)。所有已抓取的网址都会放入futurePagesToCrawl队列,如果currentPagesToCrawl,则会切换两个列表。这样做直到生成变量达到2。

答案 1 :(得分:1)

您可以在递归方法的签名上添加深度参数,例如

在您的主要

recursiveCrawl(map,0);

public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
    if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
        for (Map.Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            Map<String, String> recurSive = returnURL(key);
            recursiveCrawl(recurSive, depth);
        }
    }
]

答案 2 :(得分:0)

您可以这样做:

static int maxLevels = 10;

public static void main(String args[]) throws IOException {
     ...
     recursiveCrawl(map,0);
     ...
}

public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
    for (Map.Entry<String, String> entry : map.entrySet()) {
        String key = entry.getKey();
        Map<String, String> recurSive = returnURL(key);
        if (level < maxLevels) {
            recursiveCrawl(recurSive, ++level);
        }
    }
}

此外,您可以使用Set代替Map