Question

为什么构建在crawler4j上的以下代码仅抓取给定的种子网址并且不会开始抓取其他链接？

public static void main( String[] args )
{
      String crawlStorageFolder = "F:\\crawl";
      int numberOfCrawlers = 7;

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setMaxDepthOfCrawling(4);
      /*
       * Instantiate the controller for this crawl.
       */
      PageFetcher pageFetcher = new PageFetcher(config);

      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      robotstxtConfig.setEnabled(false);

      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller = null;
        try {
            controller = new CrawlController(config, pageFetcher, robotstxtServer);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

      /*
       * For each crawl, you need to add some seed urls. These are the first
       * URLs that are fetched and then the crawler starts following links
       * which are found in these pages
       */
      controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");        

      /*
       * Start the crawl. This is a blocking operation, meaning that your code
       * will reach the line after this only when crawling is finished.
       */
      controller.start(MyCrawler.class, numberOfCrawlers);

  }

Answer 1

官方示例仅限于www.ics.uci.edu域。因此，需要调整扩展shouldVisit类中的Crawler方法。

 /**
   * You should implement this function to specify whether the given url
   * should be crawled or not (based on your crawling logic).
   */
  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    // Ignore the url if it has an extension that matches our defined set of image extensions.
    if (IMAGE_EXTENSIONS.matcher(href).matches()) {
      return false;
    }

    // Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http".
    return href.startsWith("http://www.ics.uci.edu/");
  }

crawler4j仅抓取种子URL

1 个答案: