我正在尝试了解 Crawler4j开源网络抓取工具。平均而言我有些疑惑,如下所示,
问题: -
StatisticsDB在Counters类中做了什么。请解释以下代码部分,
public Counters(Environment env, CrawlConfig config) throws DatabaseException {
super(config);
this.env = env;
this.counterValues = new HashMap<String, Long>();
/*
* When crawling is set to be resumable, we have to keep the statistics
* in a transactional database to make sure they are not lost if crawler
* is crashed or terminated unexpectedly.
*/
if (config.isResumableCrawling()) {
DatabaseConfig dbConfig = new DatabaseConfig();
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
OperationStatus result;
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
Transaction tnx = env.beginTransaction(null, null);
Cursor cursor = statisticsDB.openCursor(tnx, null);
result = cursor.getFirst(key, value, null);
while (result == OperationStatus.SUCCESS) {
if (value.getData().length > 0) {
String name = new String(key.getData());
long counterValue = Util.byteArray2Long(value.getData());
counterValues.put(name, counterValue);
}
result = cursor.getNext(key, value, null);
}
cursor.close();
tnx.commit();
}
}
据我所知,它保存了已爬网的URL,这有助于在爬虫崩溃的情况下,然后网络爬虫不需要从头开始。 请您逐一解释上述代码。
2。我没有找到任何解释SleepyCat的好链接,因为Crawlers4j使用SleepyCat来存储中间信息。所以请告诉我一些很好的资源,从中我可以学习SleepyCat的基础知识。 (我不知道上面代码中使用的Transaction,Cursor是什么意思。)
请帮帮我。寻找你的回复。
答案 0 :(得分:1)
基本上,Crawler4j通过加载数据库中的所有值来加载数据库中的现有统计信息。 事实上,代码非常不正确,因为事务是打开的,并且不会对DB进行任何修改。因此,可以删除处理tnx的行。
逐行评论:
//Create a database configuration object
DatabaseConfig dbConfig = new DatabaseConfig();
//Set some parameters : allow creation, set to transactional db and don't use deferred write
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
//Open the database called "Statistics" with the upon created configuration
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
OperationStatus result;
//Create new database entries key and values
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
//Start a transaction
Transaction tnx = env.beginTransaction(null, null);
//Get the cursor on the DB
Cursor cursor = statisticsDB.openCursor(tnx, null);
//Position the cursor to the first occurrence of key/value
result = cursor.getFirst(key, value, null);
//While result is success
while (result == OperationStatus.SUCCESS) {
//If the value at the current cursor position is not null, get the name and the value of the counter and add it to the Hashmpa countervalues
if (value.getData().length > 0) {
String name = new String(key.getData());
long counterValue = Util.byteArray2Long(value.getData());
counterValues.put(name, counterValue);
}
result = cursor.getNext(key, value, null);
}
cursor.close();
//Commit the transaction, changes will be operated on th DB
tnx.commit();