我在Hadoop集群上运行Nutch 2(2个节点)。我将抓取命令作为
运行bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2
屏幕显示过滤后已注入765个网址。统计数据显示没有任何内容被提取。
14/05/27 01:33:44 INFO crawl.WebTableReader: Statistics for WebTable:
14/05/27 01:33:44 INFO crawl.WebTableReader: jobs: {db_stats-job_201405261214_0047= {jobID=job_201405261214_0047, jobName=db_stats, counters={File Input Format Counters = {BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=10102, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10187}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=231735296, CPU_MILLISECONDS=2570, SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=313917440, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2243407872, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/05/27 01:33:44 INFO crawl.WebTableReader: TOTAL urls: 0
为什么会这样?我的正则表达式过滤器和域过滤器设置为允许所有域(我尝试进行整个网络爬网)。