Solr schema regex transformer

时间:2017-06-15 10:02:24

标签: indexing solr

I'm trying to perform internal transformations of fields definded into my solr schema.

I've these 2 fields into my schema.xml :

<field name="source_file" type="string" indexed="true" stored="true" docValues="true"/>
<copyField source="source_file_extraction" dest="text"/> :

The field source_file contains the basename of my indexed docs (example : 1234_helloworld.pdf). I'd like use a regex to extract some data from this basename (example : extract all digits (\d*) => 1234)} and save this extraction into the field source_file_extraction.

For that, I've seen that it could be possible to use regex transformers. I configure the file solr-data-config.xml as :

<dataConfig>
  <document>
    <entity name="source_file_extraction" transformer="RegexTransformer" query="select coll from source_file_extraction">
        <field column="coll" regex=".*?-(\d\d)-.*" sourceColName="source_file"/>
    </entity>
  </document>
</dataConfig>

And I add a requestHandler into the file solrconfig.xml :

<requestHandler name="/dataimport" class="solr.DataImportHandler">
  <lst name="defaults">
    <str name="config">solr-data-config.xml</str>
  </lst>
</requestHandler>

But it not works.

How to make a simple transformation by regex of a field defined in the schema to another field of the same schema?

Thanks by advance for your help.

1 个答案:

答案 0 :(得分:1)

使用solr.PatternReplaceFilterFactory过滤器工厂进行字段&#34; source_file_extraction&#34;

为字段source_file_extraction

更新您的架构文件,如下所示
<field name="source_file_extraction" type="NameExtractor" indexed="true" stored="true"/>

<fieldType name="NameExtractor" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/>
   </analyzer>
</fieldType>

将source_file中的复制字段添加到source_file_extraction

<copyField source="source_file" dest="source_file_extraction"/>

当令牌被复制到字段source_file_extraction时,它使用过滤器并仅保留该值中的数字字符并存储。

它不会修改source_file字段值。

不要忘记在架构修改后重新启动solr。

希望这有帮助, 维诺德