我正在使用nutt 2.3.1,Solr 6.5.1和Mongodb来爬网和索引数据。我已经成功地在seed.text文件中抓取了最多5个网址,但是当我尝试抓取499个网址时,在索引时出现了以下错误。
> $ runtime/local/bin/nutch solrindex http://localhost:8983/solr/nutch -all
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local505251134_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
我的坚果日志文件如下
> 2019-03-22 16:45:07,991 INFO indexer.IndexingJob - IndexingJob: starting
2019-03-22 16:45:08,203 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2019-03-22 16:45:08,203 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2019-03-22 16:45:08,204 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2019-03-22 16:45:08,204 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2019-03-22 16:45:08,208 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer
2019-03-22 16:45:08,358 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2019-03-22 16:45:09,206 WARN conf.Configuration - file:/tmp/hadoop-USER/mapred/staging/USER505251134/.staging/job_local505251134_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2019-03-22 16:45:09,208 WARN conf.Configuration - file:/tmp/hadoop-USER/mapred/staging/USER505251134/.staging/job_local505251134_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2019-03-22 16:45:09,264 WARN conf.Configuration - file:/tmp/hadoop-USER/mapred/local/localRunner/USER/job_local505251134_0001/job_local505251134_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2019-03-22 16:45:09,265 WARN conf.Configuration - file:/tmp/hadoop-USER/mapred/local/localRunner/USER/job_local505251134_0001/job_local505251134_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2019-03-22 16:45:09,390 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: content dest: content
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: title dest: title
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: host dest: host
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: batchId dest: batchId
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: boost dest: boost
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: digest dest: digest
2019-03-22 16:45:09,408 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2019-03-22 16:45:09,410 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2019-03-22 16:45:09,411 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2019-03-22 16:45:09,411 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2019-03-22 16:45:09,411 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2019-03-22 16:45:09,411 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer
2019-03-22 16:45:09,411 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2019-03-22 16:45:09,625 INFO solr.SolrIndexWriter - Adding 250 documents
2019-03-22 16:45:09,934 INFO solr.SolrIndexWriter - Adding 250 documents
2019-03-22 16:45:10,317 INFO solr.SolrIndexWriter - Adding 129 documents
2019-03-22 16:45:10,395 INFO solr.SolrIndexWriter - Adding 129 documents
2019-03-22 16:45:10,466 WARN mapred.LocalJobRunner - job_local505251134_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=jp.or.nhk.www:http] multiple values encountered for non multiValued field meta_description: [NHK??????????????????????????????????????????????????NHK???????????????????, Japanese public broadcaster's official website with online news, profile, and press releases.]
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=jp.or.nhk.www:http] multiple values encountered for non multiValued field meta_description: [NHK??????????????????????????????????????????????????NHK???????????????????, Japanese public broadcaster's official website with online news, profile, and press releases.]
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-03-22 16:45:11,278 ERROR indexer.IndexingJob - SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local505251134_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
我尝试根据this重新启动数据库。但是无法解决错误。