NUTCH 1.13获取url失败:org.apache.nutch.protocol.ProtocolNotFound:找不到url = http

时间:2017-08-31 14:11:23

标签: solr centos nutch

  

获取httpurl失败了:   org.apache.nutch.protocol.ProtocolNotFound:找不到协议   url = http at   org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)     在org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)

     

使用队列模式:byHost   获取httpsurl失败:org.apache.nutch.protocol.ProtocolNotFound:找不到url = https的协议       at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)       在org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)

在使用solr6.6.0

运行nutch1.13时,我得到了上述结果 我使用的

命令是

  

bin / crawl -i -D   solr.server.url = http://myip/solr/nutch/ urls / crawl 2

下面是我的nutch-site.xml中的插件部分

  <name>plugin.includes</name>
  <value>
protocol-(http|httpclient)|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
</value>

以下是我的文件内容

    [root@localhost apache-nutch-1.13]# ls plugins
creativecommons      index-more           nutch-extensionpoints   protocol-file                 scoring-similarity         urlnormalizer-ajax
feed                 index-replace        parse-ext               protocol-ftp                  subcollection              urlnormalizer-basic
headings             index-static         parsefilter-naivebayes  protocol-htmlunit             tld                        urlnormalizer-host
index-anchor         language-identifier  parsefilter-regex       protocol-http                 urlfilter-automaton        urlnormalizer-pass
index-basic          lib-htmlunit         parse-html              protocol-httpclient           urlfilter-domain           urlnormalizer-protocol
indexer-cloudsearch  lib-http             parse-js                protocol-interactiveselenium  urlfilter-domainblacklist  urlnormalizer-querystring
indexer-dummy        lib-nekohtml         parse-metatags          protocol-selenium             urlfilter-ignoreexempt     urlnormalizer-regex
indexer-elastic      lib-regex-filter     parse-replace           publish-rabbitmq              urlfilter-prefix           urlnormalizer-slash
indexer-solr         lib-selenium         parse-swf               publish-rabitmq               urlfilter-regex
index-geoip          lib-xml              parse-tika              scoring-depth                 urlfilter-suffix
index-links          microformats-reltag  parse-zip               scoring-link                  urlfilter-validator
index-metadata       mimetype-filter      plugin                  scoring-opic                  urlmeta

我坚持这个问题。正如你所看到的,我已经包含了两个协议 - (http | httpclient)。但仍然提取网址失败。提前谢谢。

NEWER ISSUE hadoop.log

  

2017-09-01 14:35:07,172 INFO solr.SolrIndexWriter - SolrIndexer:   删除1/1文件2017-09-01 14:35:07,321警告   output.FileOutputCommitter - cleanupJob()中的输出路径为null   2017-09-01 14:35:07,323 WARN mapred.LocalJobRunner -   job_local1176811933_0001 java.lang.Exception:   java.lang.IllegalStateException:连接池关闭   org.apache.hadoop.mapred.LocalJobRunner $ Job.runTasks(LocalJobRunner.java:462)     在   org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:529)   引起:java.lang.IllegalStateException:连接池关闭     在org.apache.http.util.Asserts.check(Asserts.java:34)at   org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)     在   org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)     在   org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)     在   org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)     在   org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)     在   org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)     在   org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)     在   org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)     在   org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481)     在   org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)     在   org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)     在   org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)     在   org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)     在   org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)     在   org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)     在   org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)     at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)     在   org.apache.nutch.indexer.CleaningJob $ DeleterReducer.close(CleaningJob.java:122)     在org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)at   org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)     在org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)at at   org.apache.hadoop.mapred.LocalJobRunner $工作$ ReduceTaskRunnable.run(LocalJobRunner.java:319)     在   java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)     在java.util.concurrent.FutureTask.run(FutureTask.java:266)at   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)2017-09-01 14:35:07,679   错误indexer.CleaningJob - CleaningJob:java.io.IOException:Job   失败!在   org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)at at   org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174)at at   org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197)at at   org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)at   org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)

1 个答案:

答案 0 :(得分:1)

我以某种方式解决了这个问题。我认为nutch-site.xml中的空间导致发布新的plugin.includes部分,其他人来到这里。

      <name>plugin.includes</name>
  <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>