Question

我正在撰写新闻采访申请。它应该每天监控超过300个站点，并向Solr（这是我们的索引器模块）发送新的URL（新的新闻链接）。我已经安装了NUTCH并应用了所有必需的配置。一切正常，但其重新爬行模块不起作用。我重复了许多文章（例如http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/和http://wiki.apache.org/nutch/IntranetRecrawl）并应用了他们的配置，但不幸的是，它们不适用于我。我的问题是否有任何潦草或配置？

我也使用NUTCH 1.8。

此致

Answer 1

尝试对您的nutch-site.xml进行一些更改

 <property> 
   <name>db.fetch.schedule.class</name> 
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> 
  </property>

<property>
  <name>db.fetch.interval.default</name>
  <value>10</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>
  <property>
  <name>db.fetch.interval.max</name>
          <!-- for now always re-fetch everything -->
  <value>10</value>
  <description>The maximum number of seconds between re-fetches of a page
  (less than one day). After this period every page in the db will be re-tried, no
   matter what is its status.
  </description>
</property>

Nutch ReCrawl并提取新链接

1 个答案: