Nutch爬行没有错误,但结果没什么

时间:2013-04-14 03:27:27

标签: nutch web-crawler

我尝试使用nutch 2.1抓取一些网址,如下所示。

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

http://wiki.apache.org/nutch/NutchTutorial

没有错误,但是没有提到下面提到的文件夹。

crawl/crawldb
crawl/linkdb
crawl/segments

任何人都可以帮助我吗? 我已经两天没有解决这个问题了。 非常感谢!

输出如下。

FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=9
-finishing thread FetcherThread0, activeThreads=8
-finishing thread FetcherThread1, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all

运行/本地/ CONF /的nutch-site.xml中

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
<property>
  <name>http.robots.agents</name>
  <value>My Nutch Spider</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>262144</value>
</property>
</configuration>

运行/本地/ CONF /正则表达式-urlfilter.txt

# accept anything else
+.

运行/本地/网址/ seed.txt

http://nutch.apache.org/

1 个答案:

答案 0 :(得分:3)

当您使用Nutch 2.X时,您需要遵循相关的tutorial。你给的那个是Nutch 1.x. Nutch 2.X使用外部存储后端,如HBase,Cassandra,因此不会形成crawldb,segment等目录。

此外,使用bin/crawl脚本而不是bin/nutch命令。