我一直在研究网络爬虫一段时间了,这个想法很简单,我有一个包含网站列表的SQL表,我有很多线程从表中获取第一个网站并删除它,然后抓取它(以类似的方式)。
代码有点太长,所以我要尝试删除它的一些部分:
while(true){
if(!stopped){
System.gc();
Statement stmt;
String scanned = "scanned";
if (!scan)scanned = "crawled";
Connection connection = null;
try {
connection = Utils.getConnection();
} catch (Exception e1) {
connection.close();
e1.printStackTrace();
}
String name;
stmt = connection.createStatement();
ResultSet rs = null;
boolean next;
do {
rs = stmt.executeQuery("select url from websites where "+scanned+" = -1");
next = rs.next();
} while (next && Utils.inBlackList(rs.getString(1)));
if(next){
name = rs.getString(1);
stmt.executeUpdate("UPDATE websites SET "+scanned+" = 1 where url = '"+Utils.stripDomainName(name)+"'");
String backup_name = name;
name = Utils.checkUrl(name);
System.out.println(scanned + " of the website : " + name +" just started by the Thread : " + num);
// And here is the important part, I think
CrawlConfig config = new CrawlConfig();
String ts = Utils.getTime();
SecureRandom random = new SecureRandom();
String SessionId = new BigInteger(130, random).toString(32);
String crawlStorageFolder = "tmp/temp_storageadmin"+SessionId;
config.setCrawlStorageFolder(crawlStorageFolder);
config.setPolitenessDelay(Main.POLITENESS_DELAY);
config.setMaxDepthOfCrawling(Main.MAX_DEPTH_OF_CRAWLING);
config.setMaxPagesToFetch(Main.MAX_PAGES_TO_FETCH);
config.setResumableCrawling(Main.RESUMABLE_CRAWLING);
int numberOfCrawlers = Main.NUMBER_OF_CRAWLERS;
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
try {
controller = new CrawlerController(config, pageFetcher, robotstxtServer);
controller.addSeed(name);
controller.setSeeed(name);
controller.setTimestamp(ts);
controller.setSessiiid("admin"+num+scan);
//Main.crawls.addCrawl("admin"+num+scan, new Crawl(name,"admin"+num+scan,ts));
stmt.executeUpdate("DELETE FROM tempCrawl WHERE SessionID = '"+"admin"+num+scan+"'");
if (!scan){
// Main.crawls.getCrawl("admin"+num+scan).setCrawl(true);
stmt.executeUpdate("INSERT INTO tempCrawl (SessionID, url, ts, done, crawledpages, misspelled, english, proper, scan, crawl )"
+ " VALUES ( '"+"admin"+num+scan+"', '"+name+"', '"+ts+"', false, 0, 0, true, false, "+false+" , "+true+" )");
}else{
//Main.crawls.getCrawl("admin"+num+scan).setScan(true);
stmt.executeUpdate("INSERT INTO tempCrawl (SessionID, url, ts, done, crawledpages, misspelled, english, proper, scan, crawl )"
+ " VALUES ( '"+"admin"+num+scan+"', '"+name+"', '"+ts+"', false, 0, 0, true, false, "+true+" , "+false+" )");
}
connection.close();
controller.start_auto(Crawler.class, numberOfCrawlers, false, scan,num);
} catch(Exception e){
rs.close();
connection.close();
e.printStackTrace();
}
}else{
rs.close();
connection.close();
}
//CrawlerController.start_auto(scan, num);
if (stopping){
stopped = true;
stopping = false;
}
}}
} catch (Exception e) {
e.printStackTrace();
}
正如您所看到的,每次我创建一个crawlerController,并抓取一个网站等等。
这里的问题是jvm内存堆的大小不断增加。在使用您的Kit Java Profiler分析应用程序之后,我在以下代码行中找到了内存泄漏:
现在这是内存泄漏开始的确切行,这个env变量似乎占用了太多空间,并且在每次操作后不断增加,而操作是独立的。
Environment env = new Environment(envHome, envConfig);
我真的不知道这个变量做了什么,以及我如何修复它,还有一件事,我确实改变了crawlController的源代码,我认为这可能是相关的。
答案 0 :(得分:1)
假设您使用crawler4j
作为抓取框架。
每次创建crawl controller
时,都会实例化一个新的frontier
,它在爬网程序线程之间共享,以管理要爬网的URL队列。此外,还创建了一个所谓的“docIdServer”,它有责任管理是否已在此爬网中处理传入的URL(例如网站)。
此frontier
和docIdServer
基于内存数据库,environment负责缓存,锁定,日志记录和事务。因此,这个变量会随着时间的推移而增长。
如果将可恢复爬行设置为true
,则数据库将以文件模式运行,并且会慢慢增长。