Nutch没有提取任何URL

时间:2014-05-27 05:37:41

标签: hadoop solr nutch

我在Hadoop集群上运行Nutch 2(2个节点)。我将抓取命令作为

运行
bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2

屏幕显示过滤后已注入765个网址。统计数据显示没有任何内容被提取。

14/05/27 01:33:44 INFO crawl.WebTableReader: Statistics for WebTable: 
14/05/27 01:33:44 INFO crawl.WebTableReader: jobs:  {db_stats-job_201405261214_0047=     {jobID=job_201405261214_0047, jobName=db_stats, counters={File Input Format Counters =  {BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=10102,   FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1,   SLOTS_MILLIS_REDUCES=10187}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6,   MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,   COMMITTED_HEAP_BYTES=231735296, CPU_MILLISECONDS=2570, SPLIT_RAW_BYTES=1017,   COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=313917440, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2243407872, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/05/27 01:33:44 INFO crawl.WebTableReader: TOTAL urls:    0

为什么会这样?我的正则表达式过滤器和域过滤器设置为允许所有域(我尝试进行整个网络爬网)。

0 个答案:

没有答案