Nutch有几个状态代码,用于对抓取的文档进行分类。
Nutch使用的代码示例如下:
db_unfetched
db_fetched
db_gone
db_redir_perm
db_redir_temp
db_notmodified
我在哪里可以清楚地找到代码的含义?
在Stackoverflow上阅读论坛帖子和回答者可以很好地理解代码。此页面也提供了一些很好的输入:http://wiki.apache.org/nutch/CrawlDatumStates但我正在寻找描述每个状态代码含义的页面。
答案 0 :(得分:5)
没有官方文档,但我可以从CrawlDatum类中提取这个文档:
/** Page was not fetched yet. */
public static final byte STATUS_DB_UNFETCHED = 0x01;
/** Page was successfully fetched. */
public static final byte STATUS_DB_FETCHED = 0x02;
/** Page no longer exists. */
public static final byte STATUS_DB_GONE = 0x03;
/** Page temporarily redirects to other page. */
public static final byte STATUS_DB_REDIR_TEMP = 0x04;
/** Page permanently redirects to other page. */
public static final byte STATUS_DB_REDIR_PERM = 0x05;
/** Page was successfully fetched and found not modified. */
public static final byte STATUS_DB_NOTMODIFIED = 0x06;