如何删除<a> tag-content in the content field before indexing/stored in Solr

时间:2017-08-15 09:09:11

标签: html solr

I index some html-documents with solr 6.6.0. There is a lot of Link-text in the content field, which dilute the search results. So, how do I remove the tag-content in the "content"-field befor indexing/storing in Solr? Is there a way about the updateRequestProcessorChain? Anybody knows a solutions?

2 个答案:

答案 0 :(得分:0)

在索引时间内,在字段定义中使用HTMLStripCharFilterFactory作为过滤器。

此Char过滤器从输入流中删除HTML

<analyzer>
 <charFilter class="solr.HTMLStripCharFilterFactory"/>
 <tokenizer ...>
 [...]
</analyzer>

答案 1 :(得分:0)

我通过在文本之前和之后添加隐藏的div来解决问题:

<div style="display:hidden">1%%A</div>
   TEXT TEXT TEXT
<div style="display:hidden">1%%E</div>

并添加到solrconfig.xml:

<updateRequestProcessorChain name="myregex">
   <processor class="solr.RegexReplaceProcessorFactory">
       <str name="fieldName">mytextfield</str>
       <str name="pattern">([1]{1}%{2}[A]{1})(.*)([1]{1}%{2}[E]{1})</str>
       <str name="replacement"> </str>
       <bool name="literalReplacement">true</bool>
   </processor>
</updateRequestProcessorChain> 

它适用于我。