当nutch生成时运行时异常

时间:2016-12-29 23:27:19

标签: mongodb nutch

我是nutch的新手。我已经安装了nutch 2.3.1并将其配置为使用mongodb。注入操作是成功的,但是当我尝试生成它时会产生异常(见下文)。 注意:使用包含60K网址的种子文件生成此错误。所以我尝试了100个网址,一切顺利。

您是否知道此错误的原因是什么?谢谢!!!

    2016-12-30 00:01:48,446 INFO  crawl.GeneratorJob - GeneratorJob: starting at 2016-12-30 00:01:48
2016-12-30 00:01:48,447 INFO  crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2016-12-30 00:01:48,447 INFO  crawl.GeneratorJob - GeneratorJob: starting
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: filtering: true
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: normalizing: true
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: topN: 100000
2016-12-30 00:01:48,816 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-12-30 00:01:48,857 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:48,867 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:48,867 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:51,568 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-30 00:01:51,573 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-30 00:01:51,753 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-30 00:01:51,760 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-30 00:01:52,408 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:52,408 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:52,408 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:52,591 INFO  regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2016-12-30 00:02:03,229 ERROR mapreduce.GoraRecordReader - Error reading Gora records: Read operation to server localhost:27017 failed on database nutch
2016-12-30 00:02:04,607 WARN  mapred.LocalJobRunner - job_local1740651658_0001
java.lang.Exception: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:298)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:269)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:235)
    at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:145)
    at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:135)
    at com.mongodb.DBCursor._hasNext(DBCursor.java:626)
    at com.mongodb.DBCursor.hasNext(DBCursor.java:657)
    at org.apache.gora.mongodb.query.MongoDBResult.nextInner(MongoDBResult.java:71)
    at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:111)
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:118)
    ... 12 more
Caused by: java.io.EOFException
    at org.bson.io.Bits.readFully(Bits.java:75)
    at org.bson.io.Bits.readFully(Bits.java:50)
    at org.bson.io.Bits.readFully(Bits.java:37)
    at com.mongodb.Response.<init>(Response.java:42)
    at com.mongodb.DBPort$1.execute(DBPort.java:164)
    at com.mongodb.DBPort$1.execute(DBPort.java:158)
    at com.mongodb.DBPort.doOperation(DBPort.java:187)
    at com.mongodb.DBPort.call(DBPort.java:158)
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:290)
    ... 21 more
2016-12-30 00:02:04,846 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=nutch-maven-1.0-SNAPSHOT.jar, jobid=job_local1740651658_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330)

1 个答案:

答案 0 :(得分:1)

我发现问题来自mongodb版本。 Nutch使用mongo-java-driver-2.13.1.jar ad我安装了mongodb 3.4.1。所以我已经安装了mongo 2.6.7,现在它工作正常。我将尝试更新Nutch中的驱动程序并告诉您它是否适用于新版本的mongodb。