Question

我已经开始使用apache nutch进行爬网了，我一直在按照apache wiki nutch教程中显示的步骤进行操作。我能够在端口8983设置solr服务器，如说明中所示。我现在尝试使用所述工具进行索引，但我收到以下错误：

Indexer: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:164)
at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:125)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:149)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

我第一次使用solr所以任何帮助都会很好，因为我没有找到适合我的其他解决方案。

Answer 1

最可能的问题是 / solr / update 。最新版本的Solr不再支持默认收集（ / solr 和 / update 之间缺少名称）。

因此，如果您使用的是最新的（5.x）Solr，则需要该URL来反映您创建的集合名称。因此，请查看Nutch教程或文档，了解如何在URL中提供显式集合名称。

Answer 2

运行Apache Nutch 1.11和Apache Solr 5.3.1时遇到了同样的错误。通过在solr.server.url中包含核心名称（以下示例中的 test_core ）解决了该问题

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/test_core urls/ TestCrawl/  2

逐步索引到apache solr

2 个答案: