我正在尝试运行Nutch 1.6“bin / crawl”中提供的脚本,该脚本执行下面所需的所有手动步骤以启动站点。
当我手动运行这些步骤时,一切正常,我的页面按预期编制索引(虽然只有一页,但会查看此内容)
创建包含URL @ seeds / urls.txt
的文本文件bin/nutch inject crawl_test/crawldb seeds/
bin/nutch generate crawl_test/crawldb crawl_test/segments
export SEGMENT=crawl_test/segments/`ls -tr crawl_test/segments|tail -1`
bin/nutch fetch $SEGMENT -noParsing
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl_test/crawldb $SEGMENT -filter -normalize
bin/nutch invertlinks crawl_test/linkdb -dir crawl_test/segments
bin/nutch solrindex http://dev:8080/solr/ crawl_test/crawldb -linkdb crawl_test/linkdb crawl_test/segments/*
bin / crawl脚本会出现此错误...
Indexing 20130412115759 on SOLR index -> someurl:8080/solr/ SolrIndexer: starting at 2013-04-12 11:58:47 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/20130412115759/crawl_fetch Input path does not exist: file:/opt/nutch/20130412115759/crawl_parse Input path does not exist: file:/opt/nutch/20130412115759/parse_data Input path does not exist: file:/opt/nutch/20130412115759/parse_text
知道为什么这个脚本不起作用?我认为它必须是脚本本身的错误,而不是我的配置作为它寻找的路径不存在,不知道为什么它甚至会在那里看。
答案 0 :(得分:1)
看起来bin/crawl脚本存在错误
- $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $SEGMENT
+ $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT