Nutch ReCrawl并提取新链接

时间:2014-05-28 15:13:09

标签: solr lucene nutch

我正在撰写新闻采访申请。它应该每天监控超过300个站点,并向Solr(这是我们的索引器模块)发送新的URL(新的新闻链接)。我已经安装了NUTCH并应用了所有必需的配置。一切正常,但其重新爬行模块不起作用。我重复了许多文章(例如http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/http://wiki.apache.org/nutch/IntranetRecrawl)并应用了他们的配置,但不幸的是,它们不适用于我。我的问题是否有任何潦草或配置?

我也使用NUTCH 1.8。

此致

1 个答案:

答案 0 :(得分:0)

尝试对您的nutch-site.xml进行一些更改

 <property> 
   <name>db.fetch.schedule.class</name> 
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> 
  </property>

<property>
  <name>db.fetch.interval.default</name>
  <value>10</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>
  <property>
  <name>db.fetch.interval.max</name>
          <!-- for now always re-fetch everything -->
  <value>10</value>
  <description>The maximum number of seconds between re-fetches of a page
  (less than one day). After this period every page in the db will be re-tried, no
   matter what is its status.
  </description>
</property>