错误:Nutch Python lib所有作业均运行良好,但将作业编入索引失败

时间:2018-07-19 16:09:06

标签: python solr nutch

我正在尝试从运行以下脚本 https://github.com/chrismattmann/nutch-python/wiki

from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://www.ideaeng.com/nutch-ioexception-error-0506')
sd= sc.create('demo',seed_urls) 

nt = Nutch('default')
jc = JobClient(sv, 'test1', 'default')

cc = nt.Crawl(sd, sc, jc)
while True:
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
    if job == None:
        break

当我运行上面的命令时,它对其他作业很有效,但是在索引和返回作业状态为“失败”时。

Error: 
nutch.py: GET Endpoint: /job/test1-default-INDEX-810840320
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
************* URL ******************* :  http://localhost:8081/job/test1-default-INDEX-810840320
************* Data ******************* :  {}
************* Headers ******************* :  {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Thu, 19 Jul 2018 15:39:47 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test1', u'args': {u'url_dir': u'seedFiles/seed-1532014777701'}, u'state': u'FAILED', u'result': None, u'msg': u'ERROR: java.io.IOException: Job failed!', u'type': u'INDEX', u'id': u'test1-default-INDEX-810840320', u'confId': u'default'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ :  FAILED
Traceback (most recent call last):
  File "test.py", line 17, in <module>
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
  File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 570, in progress
    raise NutchCrawlException
nutch.nutch.NutchCrawlException

Hadoop.log文件     2018-07-19 21:09:46,278 INFO indexer.IndexerMapReduce-IndexerMapReduce:crawldb:test1 / crawldb     2018-07-19 21:09:46,278 INFO indexer.IndexerMapReduce-IndexerMapReduces:添加细分:test1 / segments / 20180719204031     2018-07-19 21:09:46,282 WARN mapreduce.JobResourceUploader-未执行Hadoop命令行选项解析。实施Tool接口并使用ToolRunner执行您的应用程序以对此进行纠正。     2018-07-19 21:09:46,285警告mapreduce.JobResourceUploader-未设置作业jar文件。可能找不到用户类别。请参见Job或Job#setJar(String)。     2018-07-19 21:09:46,644信息indexer.IndexWriters-添加org.apache.nutch.indexwriter.solr.SolrIndexWriter     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:内容dest:内容     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:标题dest:标题     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:主机dest:主机     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:段dest:段     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:Boost目标:Boost     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:摘要dest:摘要     2018-07-19 21:09:46,674信息solr.SolrMappingReader-来源:tstamp dest:tstamp     2018-07-19 21:09:46,741警告mapred.LocalJobRunner-job_local500026311_0064     java.lang.Exception:org.apache.solr.client.solrj.impl.HttpSolrClient $ RemoteSolrException:服务器在http://127.0.0.1:8983/solr处发生错误:预期的MIME类型为application / octet-stream,但有text / html。               未找到错误404          

HTTP错误404

    

访问/ solr / update的问题。原因:     

    Not Found

         

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2018-07-19 21:09:47,421 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案