获取html页面并解析它需要什么样的SOLR配置?

时间:2015-12-19 02:44:41

标签: solr

我一直在咨询一个又一个教程,并花了大量时间进行搜索。

我从头开始安装SOLR并启动它。

bin/solr start

我成功导航到SOLR管理员。然后我创建了一个新核心。

bin/solr create -c core_wiki -d basic_configs

我查看了bin/post命令的帮助。

bin/post -h

...
* Web crawl: bin/post -c gettingstarted http://lucene.apache.org/solr -recursive 1 -delay 1
...

所以我尝试进行类似的调用......但我一直收到FileNotFound错误。

bin/post -c core_wiki http://localhost:8983/solr/ -recursive 1 -delay 10

/usr/lib/jvm/java-7-openjdk-amd64/jre//bin/java -classpath /home/ubuntu/src/solr-5.4.0/dist/solr-core-5.4.0.jar -Dauto=yes -Drecursive=1 -Ddelay=10 -Dc=core_wiki -Ddata=web org.apache.solr.util.SimplePostTool http://localhost:8983/solr/
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/core_wiki/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, depth=1, delay=10s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/core_wiki/update/extract?literal.id=http%3A%2F%2Flocalhost%3A8983%2Fsolr&literal.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/core_wiki/update/extract. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/core_wiki/update/extract?literal.id=http%3A%2F%2Flocalhost%3A8983%2Fsolr&literal.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr
SimplePostTool: WARNING: An error occurred while posting http://localhost:8983/solr
0 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core_wiki/update/extract...
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/core_wiki/update/extract?commit=true
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/core_wiki/update/extract. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>
Time spent: 0:00:00.041

我仍然是SOLR索引的新手。任何能够指出我正确方向的提示都会受到赞赏。

1 个答案:

答案 0 :(得分:0)

您的配置中似乎缺少名为/update/extract的请求处理程序。

  

ExtractingRequestHandler 未纳入solr战争   文件,它作为SolrPlugins提供,你必须加载它(和   这是依赖关系)明确。 (Apache Solr Wiki

应在 solrconfig.xml 中定义,例如:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">