Question

我仍然习惯了Nutch。我设法使用bin/nutch crawl urls -dir crawl -depth 6 -topN 10对nutch.apache.org进行测试抓取，并使用以下代码将其编入索引：bin/nutch crawl urls -solr http://<domain>:<port>/solr/core1/ -depth 4 -topN 7

甚至没有提到它在我自己的网站上超时，我似乎无法让它再次抓取，或抓取任何其他网站（例如wiki.apache.org）。我删除了nutch主目录中的所有抓取目录，但仍然收到以下错误（说明没有其他要抓取的网址）：

<user>@<domain>:/usr/share/nutch$ sudo sh nutch-test.sh
solrUrl is not set, indexing will be skipped...
crawl started in: crawl 
rootUrlDir = urls
threads = 10
depth = 6
solrUrl=null
topN = 10
Injector: starting at 2013-07-03 15:56:47
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-03 15:56:50, elapsed: 00:00:03
Generator: starting at 2013-07-03 15:56:50
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

我的urls/seed.txt文件中包含http://nutch.apache.org/。

我的regex-urlfilter.txt中有+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*。

我还增加了-depth和topN来指定索引更多，但它总是在第一次抓取后给出错误。如何重置它以便再次爬行？是否有一些需要在Nutch某处清理过的URL缓存？

更新：我们网站的问题似乎是我没有使用www，没有www就无法解决。通过ping，www.ourdomain.org确实可以解决。

但是我把它放到了必要的文件中，但仍然存在问题。主要看起来Injector: total number of urls rejected by filters: 1是全面的问题，但不是第一次爬行。为什么和什么过滤器拒绝URL，它不应该是。

Answer 1

这让人很尴尬。但旧的nutch-not-crawling-because-it-dis-dislling-urls addage of＆＃39;检查你的*-urlfilter.txt＆＃39;文件适用于此处。

在我的情况下，我在url正则表达式中有一个额外的/：

+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*

应该是+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*

在Nutch上获取No Urls以获取错误，即使有要获取的Urls也是如此

1 个答案: