Question

I index some html-documents with solr 6.6.0. There is a lot of Link-text in the content field, which dilute the search results. So, how do I remove the tag-content in the "content"-field befor indexing/storing in Solr? Is there a way about the updateRequestProcessorChain? Anybody knows a solutions?

Answer 1

在索引时间内，在字段定义中使用HTMLStripCharFilterFactory作为过滤器。

此Char过滤器从输入流中删除HTML

<analyzer>
 <charFilter class="solr.HTMLStripCharFilterFactory"/>
 <tokenizer ...>
 [...]
</analyzer>

Answer 2

我通过在文本之前和之后添加隐藏的div来解决问题：

<div style="display:hidden">1%%A</div>
   TEXT TEXT TEXT
<div style="display:hidden">1%%E</div>

并添加到solrconfig.xml：

<updateRequestProcessorChain name="myregex">
   <processor class="solr.RegexReplaceProcessorFactory">
       <str name="fieldName">mytextfield</str>
       <str name="pattern">([1]{1}%{2}[A]{1})(.*)([1]{1}%{2}[E]{1})</str>
       <str name="replacement"> </str>
       <bool name="literalReplacement">true</bool>
   </processor>
</updateRequestProcessorChain>

它适用于我。

如何删除<a> tag-content in the content field before indexing/stored in Solr

2 个答案: