Nutch Crawl不起作用

时间:2017-01-03 04:48:17

标签: solr nutch

我想使用Apache Nutch 1.12抓取网站并将数据索引到Apache Solr。我遵循了这个tutorial

我的seed.txt文件有此网址http://nutch.apache.org/

在我的正则表达式网址过滤器中我喜欢这个+ ^ http://([a-z0-9] *。)* nutch.apache.org /

当我尝试获取数据时,我只获取了seed.txt文件中的url。

Fetcher: starting at 2017-01-03 09:56:23
Fetcher: segment: crawl/segments/20170103095613
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
robots.txt whitelist not configured.
robots.txt whitelist not configured.
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0

我在这里缺少什么。

1 个答案:

答案 0 :(得分:0)

我尝试再次执行获取操作以获得预期的结果。