解析一页后,Nutch抓取停止

时间:2013-09-12 10:30:49

标签: web-crawler nutch

使用nutch进行爬行时,它只被解析为一页而不是向前移动。谁能请帮忙。以下是荷兰输出。

解析第一页后,它正在停止而不再移动。未成功解析。

[Naveen@01hw5189 apache-nutch-1.7]$ bin/nutch crawl urls -dir crawlwiki -depth 10 -topN 10
solrUrl is not set, indexing will be skipped...
crawl started in: crawlwiki
rootUrlDir = urls
threads = 10
depth = 10
solrUrl=null
topN = 10
Injector: starting at 2013-09-12 15:51:45
Injector: crawlDb: crawlwiki/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-12 15:51:47, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155149
Generator: finished at 2013-09-12 15:51:50, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:51:50
Fetcher: segment: crawlwiki/segments/20130912155149
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://en.wikipedia.org/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:51:53, elapsed: 00:00:03
ParseSegment: starting at 2013-09-12 15:51:53
ParseSegment: segment: crawlwiki/segments/20130912155149
ParseSegment: finished at 2013-09-12 15:51:54, elapsed: 00:00:01
CrawlDb update: starting at 2013-09-12 15:51:54
CrawlDb update: db: crawlwiki/crawldb
CrawlDb update: segments: [crawlwiki/segments/20130912155149]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-09-12 15:51:56, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:56
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155159
Generator: finished at 2013-09-12 15:52:00, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:52:00
Fetcher: segment: crawlwiki/segments/20130912155159
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://en.wikipedia.org/wiki/Main_Page (queue crawl delay=5000ms)
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:52:02, elapsed: 00:00:02
ParseSegment: starting at 2013-09-12 15:52:02
ParseSegment: segment: crawlwiki/segments/20130912155159
Parsed (8ms):http://en.wikipedia.org/wiki/Main_Page

1 个答案:

答案 0 :(得分:1)

处查看维基百科的robots.txt文件

http://en.wikipedia.org/robots.txt

robots.txt可能会拒绝进一步深度搜索。机器人文件定义了Web爬虫可以访问的内容,Nutch遵守这个“netiquitte”

希望有所帮助