我是nutch的新手。我已经安装了nutch 2.3.1并将其配置为使用mongodb。注入操作是成功的,但是当我尝试生成它时会产生异常(见下文)。 注意:使用包含60K网址的种子文件生成此错误。所以我尝试了100个网址,一切顺利。
您是否知道此错误的原因是什么?谢谢!!!
2016-12-30 00:01:48,446 INFO crawl.GeneratorJob - GeneratorJob: starting at 2016-12-30 00:01:48
2016-12-30 00:01:48,447 INFO crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2016-12-30 00:01:48,447 INFO crawl.GeneratorJob - GeneratorJob: starting
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: filtering: true
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: normalizing: true
2016-12-30 00:01:48,448 INFO crawl.GeneratorJob - GeneratorJob: topN: 100000
2016-12-30 00:01:48,816 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-12-30 00:01:48,857 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:48,867 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:48,867 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:51,568 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-30 00:01:51,573 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-30 00:01:51,753 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-30 00:01:51,760 WARN conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-30 00:01:52,408 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:52,408 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:52,408 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:52,591 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2016-12-30 00:02:03,229 ERROR mapreduce.GoraRecordReader - Error reading Gora records: Read operation to server localhost:27017 failed on database nutch
2016-12-30 00:02:04,607 WARN mapred.LocalJobRunner - job_local1740651658_0001
java.lang.Exception: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:298)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:269)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:235)
at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:145)
at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:135)
at com.mongodb.DBCursor._hasNext(DBCursor.java:626)
at com.mongodb.DBCursor.hasNext(DBCursor.java:657)
at org.apache.gora.mongodb.query.MongoDBResult.nextInner(MongoDBResult.java:71)
at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:111)
at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:118)
... 12 more
Caused by: java.io.EOFException
at org.bson.io.Bits.readFully(Bits.java:75)
at org.bson.io.Bits.readFully(Bits.java:50)
at org.bson.io.Bits.readFully(Bits.java:37)
at com.mongodb.Response.<init>(Response.java:42)
at com.mongodb.DBPort$1.execute(DBPort.java:164)
at com.mongodb.DBPort$1.execute(DBPort.java:158)
at com.mongodb.DBPort.doOperation(DBPort.java:187)
at com.mongodb.DBPort.call(DBPort.java:158)
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:290)
... 21 more
2016-12-30 00:02:04,846 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=nutch-maven-1.0-SNAPSHOT.jar, jobid=job_local1740651658_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330)
答案 0 :(得分:1)
我发现问题来自mongodb版本。 Nutch使用mongo-java-driver-2.13.1.jar ad我安装了mongodb 3.4.1。所以我已经安装了mongo 2.6.7,现在它工作正常。我将尝试更新Nutch中的驱动程序并告诉您它是否适用于新版本的mongodb。