我试图通过以下命令索引我的Nuch爬网数据:
bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc97b4177a_600f_4040_9309_e632c116443f/solr/localWebCollection/" -D solr.auth=true -D solr.auth.username="USER" -D solr.auth.password="PASS" final/crawl/crawldb -linkdb final/crawl
我没有错误,但是当我运行它时,几秒后它就会结束并且不会编入索引。 这是我的日志:
2016-07-22 20:03:09,599 INFO indexer.IndexingJob - Indexer: starting at 2016-07-22 20:03:09
2016-07-22 20:03:09,707 INFO indexer.IndexingJob - Indexer: deleting gone documents: false
2016-07-22 20:03:09,708 INFO indexer.IndexingJob - Indexer: URL filtering: false
2016-07-22 20:03:09,708 INFO indexer.IndexingJob - Indexer: URL normalizing: false
2016-07-22 20:03:10,216 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-07-22 20:03:10,216 INFO indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
2016-07-22 20:03:10,220 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: final/crawl/crawldb
2016-07-22 20:03:10,220 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: final/crawl
2016-07-22 20:03:10,376 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-22 20:03:10,495 WARN indexer.IndexerMapReduce - Ignoring linkDb for indexing, no linkDb found in path: final/crawl
2016-07-22 20:03:11,381 WARN conf.Configuration - file:/tmp/hadoop-sdavari/mapred/staging/sdavari1351924025/.staging/job_local1351924025_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-07-22 20:03:11,385 WARN conf.Configuration - file:/tmp/hadoop-sdavari/mapred/staging/sdavari1351924025/.staging/job_local1351924025_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-07-22 20:03:11,551 WARN conf.Configuration - file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local1351924025_0001/job_local1351924025_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-07-22 20:03:11,557 WARN conf.Configuration - file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local1351924025_0001/job_local1351924025_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-07-22 20:03:11,880 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-07-22 20:03:13,437 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-07-22 20:03:13,448 INFO solr.SolrUtils - Authenticating as: f4a73627-777b-4d13-af60-df67be41ecb5
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: content dest: content
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: title dest: title
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: host dest: host
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: url dest: url
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: segment dest: segment
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: boost dest: boost
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: digest dest: digest
2016-07-22 20:03:13,673 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-07-22 20:03:14,605 INFO solr.SolrUtils - Authenticating as: f4a73627-777b-4d13-af60-df67be41ecb5
2016-07-22 20:03:14,613 INFO solr.SolrMappingReader - source: content dest: content
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: title dest: title
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: host dest: host
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: url dest: url
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: segment dest: segment
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: boost dest: boost
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: digest dest: digest
2016-07-22 20:03:14,614 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-07-22 20:03:15,685 INFO indexer.IndexingJob - Indexer: number of documents indexed, deleted, or skipped:
2016-07-22 20:03:15,695 INFO indexer.IndexingJob - Indexer: finished at 2016-07-22 20:03:15, elapsed: 00:00:06
任何想法,我如何解决这个问题并使其成为我的数据索引? 该URL适用于Bluemix Retrieve和Rank Service,但它建立在Apache Solr之上,所以我猜我可以使用它,只要我的Nutch和Solr的Schema匹配。正确?