StatisticsDB在Crawler4j中做了什么开源代码?

时间:2013-05-17 12:15:46

标签: web-crawler crawler4j

我正在尝试了解 Crawler4j开源网络抓取工具。平均而言我有些疑惑,如下所示,

问题: -

  1. StatisticsDB在Counters类中做了什么。请解释以下代码部分,

     public Counters(Environment env, CrawlConfig config) throws DatabaseException {
        super(config);
    
        this.env = env;
        this.counterValues = new HashMap<String, Long>();
    
        /*
         * When crawling is set to be resumable, we have to keep the statistics
         * in a transactional database to make sure they are not lost if crawler
         * is crashed or terminated unexpectedly.
         */
        if (config.isResumableCrawling()) {
            DatabaseConfig dbConfig = new DatabaseConfig();
            dbConfig.setAllowCreate(true);
            dbConfig.setTransactional(true);
            dbConfig.setDeferredWrite(false);
            statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
            OperationStatus result;
            DatabaseEntry key = new DatabaseEntry();
            DatabaseEntry value = new DatabaseEntry();
            Transaction tnx = env.beginTransaction(null, null);
            Cursor cursor = statisticsDB.openCursor(tnx, null);
            result = cursor.getFirst(key, value, null);
    
            while (result == OperationStatus.SUCCESS) {
                if (value.getData().length > 0) {
                    String name = new String(key.getData());
                    long counterValue = Util.byteArray2Long(value.getData());
                    counterValues.put(name, counterValue);
                }
                result = cursor.getNext(key, value, null);
            }
            cursor.close();
            tnx.commit();
        }
    }
    
  2. 据我所知,它保存了已爬网的URL,这有助于在爬虫崩溃的情况下,然后网络爬虫不需要从头开始。 请您逐一解释上述代码。

    2。我没有找到任何解释SleepyCat的好链接,因为Crawlers4j使用SleepyCat来存储中间信息。所以请告诉我一些很好的资源,从中我可以学习SleepyCat的基础知识。 (我不知道上面代码中使用的Transaction,Cursor是什么意思。)

    请帮帮我。寻找你的回复。

1 个答案:

答案 0 :(得分:1)

基本上,Crawler4j通过加载数据库中的所有值来加载数据库中的现有统计信息。 事实上,代码非常不正确,因为事务是打开的,并且不会对DB进行任何修改。因此,可以删除处理tnx的行。

逐行评论:

//Create a database configuration object 
DatabaseConfig dbConfig = new DatabaseConfig();
//Set some parameters : allow creation, set to transactional db and don't use deferred    write
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
//Open the database called "Statistics" with the upon created configuration
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);

 OperationStatus result;
//Create new database entries key and values
    DatabaseEntry key = new DatabaseEntry();
    DatabaseEntry value = new DatabaseEntry();
//Start a transaction
    Transaction tnx = env.beginTransaction(null, null);
//Get the cursor on the DB
    Cursor cursor = statisticsDB.openCursor(tnx, null);
//Position the cursor to the first occurrence of key/value
    result = cursor.getFirst(key, value, null);
//While result is success
    while (result == OperationStatus.SUCCESS) {
//If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
        if (value.getData().length > 0) {
            String name = new String(key.getData());
            long counterValue = Util.byteArray2Long(value.getData());
            counterValues.put(name, counterValue);
        }
        result = cursor.getNext(key, value, null);
    }
    cursor.close();
//Commit the transaction, changes will be operated on th DB
    tnx.commit();

我也回答了类似的问题here。 关于SleepyCat,您在谈论this吗?