Nutch 2.1 cassandra后端产生错误

时间:2013-04-25 16:08:41

标签: cassandra nutch gora

我选择cassandra作为后端并开始玩nutch。

DMOZ网址的小子集(~50k),全部(注入,生成,获取)运行良好。

然而,在我注入整个DMOZ url set(~3.5M)并尝试生成fetchlist之后,我得到了以下错误,这在另一个系统上是可重现的:

~/software/nutch_dmoz/local$ ./bin/nutch generate -topN 1000
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: topN: 1000
GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:213)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:241)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:249)

日志/ hadoop.log:

2013-04-25 17:58:07,986 INFO  crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: starting
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: filtering: true
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: topN: 1000
2013-04-25 17:58:08,570 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10
s
2013-04-25 17:58:08,660 INFO  service.JmxMonitor - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorT
ype=hector
2013-04-25 17:58:09,029 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes w
here applicable
2013-04-25 17:58:09,403 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-04-25 17:58:09,435 INFO  plugin.PluginRepository - Plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository - Registered Plugins:
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository - Registered Extension-Points:
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Parse Filter (org.apache.nutch.parse.ParseFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2013-04-25 17:58:09,582 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-04-25 17:58:09,582 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-04-25 17:58:09,582 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-04-25 17:58:11,046 INFO  regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-04-25 18:01:02,936 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-04-25 18:01:02,936 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.ArrayIndexOutOfBoundsException
2013-04-25 18:01:03,412 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

据我所见,我没有耗尽磁盘空间。 / tmp分区有250G可用空间,cassandra运行的分区有2.5T可用空间。有没有可能增加冗长度?另外,我想知道ArrayOutOfBoundsException没有告诉它试图访问的绑定,只是没有。键空间网页已存在,我可以使用cassandra-cli访问它。这是readdb -stats的输出:

~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats
WebTable statistics start
Statistics for WebTable: 
min score:  55.0
retry 0:    3576393
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score:  1.0
TOTAL urls: 3576393
status 0 (null):    3576393
avg score:  1.0
WebTable statistics: done
min score:  55.0
retry 0:    3576393
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score:  1.0
TOTAL urls: 3576393
status 0 (null):    3576393
avg score:  1.0

0 个答案:

没有答案