我正在尝试从运行以下脚本 https://github.com/chrismattmann/nutch-python/wiki
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://www.ideaeng.com/nutch-ioexception-error-0506')
sd= sc.create('demo',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test1', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
if job == None:
break
当我运行上面的命令时,它对其他作业很有效,但是在索引和返回作业状态为“失败”时。
Error:
nutch.py: GET Endpoint: /job/test1-default-INDEX-810840320
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
************* URL ******************* : http://localhost:8081/job/test1-default-INDEX-810840320
************* Data ******************* : {}
************* Headers ******************* : {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Thu, 19 Jul 2018 15:39:47 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test1', u'args': {u'url_dir': u'seedFiles/seed-1532014777701'}, u'state': u'FAILED', u'result': None, u'msg': u'ERROR: java.io.IOException: Job failed!', u'type': u'INDEX', u'id': u'test1-default-INDEX-810840320', u'confId': u'default'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 17, in <module>
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 570, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
Hadoop.log文件 2018-07-19 21:09:46,278 INFO indexer.IndexerMapReduce-IndexerMapReduce:crawldb:test1 / crawldb 2018-07-19 21:09:46,278 INFO indexer.IndexerMapReduce-IndexerMapReduces:添加细分:test1 / segments / 20180719204031 2018-07-19 21:09:46,282 WARN mapreduce.JobResourceUploader-未执行Hadoop命令行选项解析。实施Tool接口并使用ToolRunner执行您的应用程序以对此进行纠正。 2018-07-19 21:09:46,285警告mapreduce.JobResourceUploader-未设置作业jar文件。可能找不到用户类别。请参见Job或Job#setJar(String)。 2018-07-19 21:09:46,644信息indexer.IndexWriters-添加org.apache.nutch.indexwriter.solr.SolrIndexWriter 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:内容dest:内容 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:标题dest:标题 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:主机dest:主机 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:段dest:段 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:Boost目标:Boost 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:摘要dest:摘要 2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:tstamp dest:tstamp 2018-07-19 21:09:46,741警告mapred.LocalJobRunner-job_local500026311_0064 java.lang.Exception:org.apache.solr.client.solrj.impl.HttpSolrClient $ RemoteSolrException:服务器在http://127.0.0.1:8983/solr处发生错误:预期的MIME类型为application / octet-stream,但有text / html。 未找到错误404
访问/ solr / update的问题。原因:
Not Found
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p>
</body>
</html>
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 21:09:47,421 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)