具有递归的WebCrawler

时间:2014-03-03 08:26:45

标签: java recursion web-crawler

所以我正在开发一个应该下载所有图像,文件和网页的webcrawler,然后递归地为找到的所有网页做同样的事情。但是,我似乎有一个逻辑错误。

    public class WebCrawler {

   private static String url;
   private static int maxCrawlDepth;
   private static String filePath;

   /* Recursive function that crawls all web pages found on a given web page.
    * This function also saves elements from the DownloadRepository to disk.
    */

  public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {

     webpage.crawl(currentCrawlDepth);

     HashMap<String, WebPage> pages = webpage.getCrawledWebPages();

        if(currentCrawlDepth < maxCrawlDepth) {
           for(WebPage wp : pages.values()) {
              crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
           }
        }
     }

   public static void main(String[] args) {

      if(args.length != 3) {
         System.out.println("Must pass three parameters");
         System.exit(0);
      }

      url = "";
      maxCrawlDepth = 0;
      filePath = "";

      url = args[0];
      try {
         URL testUrl = new URL(url);
         URLConnection urlConnection = testUrl.openConnection();
         urlConnection.connect();
      } catch (MalformedURLException e) {
         System.out.println("Not a valid URL");
         System.exit(0);
      } catch (IOException e) {
         System.out.println("Could not open URL");
         System.exit(0);
      }

      try {
         maxCrawlDepth = Integer.parseInt(args[1]);
      } catch (NumberFormatException e) {
         System.out.println("Argument is not an int");
         System.exit(0);
      }

      filePath = args[2];
      File path = new File(filePath);
      if(!path.exists()) {
         System.out.println("File Path is invalid");
         System.exit(0);
      }

      WebPage webpage = new WebPage(url);
      crawling(webpage, 0, maxCrawlDepth);

      System.out.println("Web crawl is complete");
   }

}

函数crawl会将存储任何找到的图像,文件或链接的网站内容解析为散列图,例如:

    public class WebPage implements WebElement {

   private static Elements images;
   private static Elements links;

   private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
   private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
   private HashMap<String, WebFile> files = new HashMap<String, WebFile>();

   private String url;

   public WebPage(String url) {
      this.url = url;
   }

   /* The crawl method parses the html on a given web page
    * and adds the elements of the web page to the Download
    * Repository.
    */
   public void crawl(int currentCrawlDepth) {

      System.out.print("Crawling " + url + " at crawl depth ");
      System.out.println(currentCrawlDepth + "\n");

      Document doc = null;

      try {
         HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
         httpConnection.ignoreContentType(true);
         doc = httpConnection.get();

      } catch (MalformedURLException e) {
         System.out.println(e.getLocalizedMessage()); 
      } catch (IOException e) {
         System.out.println(e.getLocalizedMessage());
      } catch (IllegalArgumentException e) {
         System.out.println(url + "is not a valid URL");
      }

      DownloadRepository downloadRepository = DownloadRepository.getInstance();

      if(doc != null) {
         images = doc.select("img");
         links = doc.select("a[href]");

         for(Element image : images) {
            String imageUrl = image.absUrl("src");
            if(!webImages.containsValue(image)) {
               WebImage webImage = new WebImage(imageUrl);
               webImages.put(imageUrl, webImage);
               downloadRepository.addElement(imageUrl, webImage);
               System.out.println("Added image at " + imageUrl);
            }
         }

         HttpConnection mimeConnection = null;
         Response mimeResponse = null;

         for(Element link: links) {
            String linkUrl = link.absUrl("href");
            linkUrl = linkUrl.trim();
            if(!linkUrl.contains("#")) {
               try {
                  mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
                  mimeConnection.ignoreContentType(true);
                  mimeConnection.ignoreHttpErrors(true);
                  mimeResponse = (Response) mimeConnection.execute();
               } catch (Exception e) {
                  System.out.println(e.getLocalizedMessage());
               }

               String contentType = null;
               if(mimeResponse != null) {
                  contentType = mimeResponse.contentType();
               }

               if(contentType == null) {
                  continue;
               }
               if(contentType.toString().equals("text/html")) {
                  if(!webPages.containsKey(linkUrl)) {
                     WebPage webPage = new WebPage(linkUrl);
                     webPages.put(linkUrl, webPage);
                     downloadRepository.addElement(linkUrl, webPage);
                     System.out.println("Added webPage at " + linkUrl);
                  }
               }
               else {
                  if(!files.containsValue(link)) {
                     WebFile webFile = new WebFile(linkUrl);
                     files.put(linkUrl, webFile);
                     downloadRepository.addElement(linkUrl, webFile);
                     System.out.println("Added file at " + linkUrl);
                  }
               }

            }
         }

      }

      System.out.print("\nFinished crawling " + url + " at crawl depth ");
      System.out.println(currentCrawlDepth + "\n");
   }

   public HashMap<String, WebImage> getImages() {
      return webImages;
   }

   public HashMap<String, WebPage> getCrawledWebPages() {
      return webPages;
   }

   public HashMap<String, WebFile> getFiles() {
      return files;
   }

   public String getUrl() {
      return url;
   }

   @Override
   public void saveToDisk(String filePath) {
      System.out.println(filePath);
   }
}

使用hashmap的目的是确保我不会多次解析同一个网站。错误似乎与我的递归有关。有什么问题?

以下是在http://www.google.com

开始抓取的一些示例输出
Crawling https://www.google.com/ at crawl depth 0

Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0

Crawling https://www.google.com/services/ at crawl depth 1

Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**

Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/

Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**

请注意,它会两次解析http://www.google.com/intl/en/policies/

1 个答案:

答案 0 :(得分:1)

您正在为每个网页创建新地图。这将确保如果页面上出现相同的链接两次,它将只被抓取一次,但它不会处理两个不同页面上出现相同链接的情况。

https://www.google.com/intl/en/policies/同时出现在https://www.google.com/https://www.google.com/services/

为避免这种情况,请在整个爬网过程中使用单个映射,并将其作为参数传递给递归。

public class WebCrawler {

    private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();

    public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {

    }
}

当您还拿着图像等的地图时,您可能更喜欢创建一个新对象,也许称之为visited,并使其保持跟踪。

public class Visited {

    private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();

    public boolean visit(String url, WebPage page) {
        if (webPages.containsKey(page)) {
            return false;
        }
        webPages.put(url, page);
        return true;
    }

    private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();

    public boolean visit(String url, WebImage image) {
        if (webImages.containsKey(image)) {
            return false;
        }
        webImages.put(url, image);
        return true;
    }

}