鉴于这个简单的代码:
CrawlConfig config = new CrawlConfig();
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setResumableCrawling(false);
config.setIncludeBinaryContentInCrawling(false);
config.setCrawlStorageFolder(Config.get(Config.CRAWLER_SHARED_DIR) + "test/");
config.setShutdownOnEmptyQueue(false);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://localhost/test");
controller.startNonBlocking(WebCrawler.class, 1);
long counter = 1;
while(Thread.currentThread().isAlive()) {
System.out.println(config.toString());
for (int i = 0; i < 4; i++) {
System.out.println("Adding link");
controller.addSeed("http://localhost/test" + ++counter + "/");
}
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
计划的输出是:
18:48:02.411 [main] INFO - Obtained 6791 TLD from packaged file tld-names.txt
18:48:02.441 [main] INFO - Deleted contents of: /home/scraper/test/frontier ( as you have configured resumable crawling to false )
18:48:02.636 [main] INFO - Crawler 1 started
18:48:02.636 [Crawler 1] INFO - Crawler Crawler 1 started!
Adding link
Adding link
Adding link
Adding link
18:48:02.685 [Crawler 1] WARN - Skipping URL: http://localhost/test, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:03.642 [Crawler 1] WARN - Skipping URL: http://localhost/test2/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:04.642 [Crawler 1] WARN - Skipping URL: http://localhost/test3/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:05.643 [Crawler 1] WARN - Skipping URL: http://localhost/test4/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:06.642 [Crawler 1] WARN - Skipping URL: http://localhost/test5/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
为什么crawler4j没有访问test6,test7及以上版本?
如您所见,它们之前的所有4个链接都会被正确添加和访问。
当我设置&#34; http://localhost/&#34;作为seedUrl(在启动爬虫之前),它最多处理13个链接,然后出现上述问题。
我想要获取的是一种情况,当我可以添加网址来访问从其他线程运行的爬虫(在运行时)。
@EDIT: 我已经通过@Seth的建议查看了线程转储,但我无法找到它为什么不能工作。
"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
- locked <0x00000005959baff8> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000596afdd28> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151)
- locked <0x0000000596afdd28> (a java.lang.Object)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None