我正在撰写新闻采访申请。它应该每天监控超过300个站点,并向Solr(这是我们的索引器模块)发送新的URL(新的新闻链接)。我已经安装了NUTCH并应用了所有必需的配置。一切正常,但其重新爬行模块不起作用。我重复了许多文章(例如http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/和http://wiki.apache.org/nutch/IntranetRecrawl)并应用了他们的配置,但不幸的是,它们不适用于我。我的问题是否有任何潦草或配置?
我也使用NUTCH 1.8。
此致
答案 0 :(得分:0)
尝试对您的nutch-site.xml进行一些更改
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>10</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The maximum number of seconds between re-fetches of a page
(less than one day). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>