Question

我使用solr3.6和tika1.2但我无法上传pdf文件。首先我安装solr并从exampledocs上传一些* .xml文件。我可以使用此网址http://localhost:8983/solr/select/?q=solr搜索此文件。在下一步中，我安装tika来上传pdf和doc文件，但它不起作用。以下内容位于“example / solr / conf / solrconf.xml”文件中。

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults"><str name="fmap.content">text</str><str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="tika.config">tika-data-config.xml</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>`

在文件“example / solr / conf / tika-data-config.xml”中我有这样的内容：

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource" />
  <document>
    <entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="/home/ubuntu-user/Documents" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip" recursive="true">
      <field column="fileAbsolutePath" name="path" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastmodified" /><entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
      <field column="Author" name="author" meta="true"/>
      <field column="title" name="title" meta="true"/>
    </entity>

如果我将这些行放在控制台中

curl http://localhost:8983/solr/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@test.pdf"

我得到了这个输出

<?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">183</int>
    </lst>
  </response>

但我无法使用solr搜索内容。如果我浏览到此网址：http://localhost:8983/solr/browse，我会看到一个新条目，但没有内容。

我也启动了solr和tika服务器：

java -jar start.jar
java -jar tika-server-1.2.jar

任何人都可以帮助我吗？

Answer 1

你需要在dist文件夹中添加apache-solr-dataimporthandler-3.6，apache-solr-dataimporthandler-extras-3.6和apache-solr-cell-3.6的jar（或路径）以及contrib中的相应文件文件夹中。

然后你可以在不启动Tika服务器的情况下从Solr中提取pdf。

Answer 2

检查ExtractingRequestHandler，这有助于您为Rich文档编制索引您不需要启动单独的Tika Server，因为Solr可以使用添加的库来从富文档中提取内容。

所需的jar（需要依赖的Solr Cell和Tika Jars）可能在配置中： -

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" /> 
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

Answer 3

现在我已经安装了solr new，我可以通过此URL搜索pdf

http://localhost:8983/solr/select/?q=attr_content:st*

有些PDF可以，但是通过任何PDF我都可以获得此输出

<arr name="attr_content"><str>                         ((stdin))      � ���������

attr_creation_date和attr_meta都可以。制作人是Ghostscript。 GPL Ghostscript 8.63

使用tika1.2配置apache solr3.6

3 个答案: