为什么构建在crawler4j
上的以下代码仅抓取给定的种子网址并且不会开始抓取其他链接?
public static void main( String[] args )
{
String crawlStorageFolder = "F:\\crawl";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(4);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = null;
try {
controller = new CrawlController(config, pageFetcher, robotstxtServer);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
答案 0 :(得分:3)
官方示例仅限于www.ics.uci.edu
域。因此,需要调整扩展shouldVisit
类中的Crawler
方法。
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
// Ignore the url if it has an extension that matches our defined set of image extensions.
if (IMAGE_EXTENSIONS.matcher(href).matches()) {
return false;
}
// Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http".
return href.startsWith("http://www.ics.uci.edu/");
}