Extracting PDF from Apache Solr

时间:2017-07-09 13:14:21

标签: indexing solr

I am new to Solr indexing. I used Solr 5.5 and indexed a pdf file in it by simply using

#bin/post -c gettingstarted /home/ubuntu/pdf.pdf

I deleted the source pdf file. Is there anyway I can extract the pdf file from Apache Solr. I can see it is indexed from the URL

http://localhost:8983/solr/gettingstarted/select?q=*.pdf

Thanks in advance.

1 个答案:

答案 0 :(得分:1)

如果默认情况下索引正确,则pdf内容会在模式中正确声明,并被索引到字段名content中。所以使用该内容字段搜索一些关键字(或*)。

例: q=content:keyword(关键字 - >,以pdf形式出现)

http://localhost:8983/solr/gettingstarted/select?q=content:*

如果contetnt字段未定义。然后在模式文件中添加字段定义。

例如:字段名称声明

<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>

字段类型定义

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>